Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 149 forks source link

WARNING: Chromosome chr3_GL000221v1_random not found in annotation sources or synonyms; chromosome chr3_GL000221v1_random does not overlap any features #1640

Closed bobojin46 closed 2 weeks ago

bobojin46 commented 3 months ago

Describe the issue

I used VEP docker to annotate 4 vcf files. It seemed ran successfully but with a few warning message. It seemed there are a few warning message in each vcf. Should I worried about this?

System

Full VEP command line

#!/bin/bash
> output.log
for vcf_file in /opt/vep/.vep/*.vcf; do
    filename=$(basename "$vcf_file")
    echo "Process \"$filename\" start"
    echo "Process \"$filename\" start" >> output.log
    ./vep -i "$vcf_file" -o "${vcf_file%.*}_annotated.vcf" --format vcf --vcf --symbol --terms SO --tsl --biotype --hgvs --fasta /opt/vep/.vep/database/hg38.fa --coding_only --af_gnomade --sift b --polyphen b --plugin Frameshift --plugin Wildtype -dir_plugins /opt/vep/.vep/VEP_plugins --offline --cache /opt/vep/.vep -pick --transcript_version >> output.log
    echo "Process \"$filename\" finish"
    echo "Process \"$filename\" finish" >> output.log
done

and cache file is /opt/vep/.vep/homo_sapiens/111_GRCh38

Full error message

The log file is attached. output.log

Data files (zip file is attached since vcf type is not surported)

nakib103 commented 3 months ago

Hello @bobojin46,

Thanks for your query! It seems some of the variants getting skipped that should not and showing log message like this -

WARNING: line 35 skipped (chrUn_KI270743v1 116744 . TTATAATGCAATCACA T ....): Chromosome chrUn_KI270743v1 not found in annotation sources or synonyms; chromosome chrUn_KI270743v1 does not overlap any features

Can you make sure you have the correct synonyms file in your cache directory? For example, it will contain the synonyms for the sequence region that are getting skipped -

$ cat /opt/vep/.vep/homo_sapiens/111_GRCh38/chr_synonyms.txt | grep chrUn_KI270743v1
KI270743.1  chrUn_KI270743v1
NT_187498.1 chrUn_KI270743v1

If not, can you try re-downloading the cache and check again.

Best regards, Nakib

bobojin46 commented 3 months ago

Thanks for your early response! I have checked the synonyms file with several skipped variants in my output.log. It seemed the synonyms file in cache file is fine.

image

Though I have noticed the generated vcf is acquired by aligning to hg38 genome while the cache file is named 111_GRCh38. I'm kind new to genome analysis. So could you please help me out? Should I worried the warning because, for example, the skipped chrUn_KI270743v1 only occurred in the meta information line with “##” not in real variant line.

If this skipped character are not presented in real variant lin,is it ok to ignore the warning or to solve this until no more warning message occur.

nakib103 commented 3 months ago

Hello @bobojin46,

hg38 genome is the same as GRCh38 so the downloaded cache folder name is not wrong. But I suspect, the cache did not download properly. For example you will see separate folder for sequence regions (e.g - KI270743.1) under the cache directory. I think it is best if you try re-downloading the cache again if you want to get annotation for the variant that are getting skipped.

Many of the variant that are skipped are from unlocalised/unplaced contigs. Whether you should skip them depends on the project you are working on and unfortunately not in my scope to comment on.

Best regards, Nakib

bobojin46 commented 3 months ago
image

There are such subfolders in the cache directory. And the size of my downloaded zip file is 23.34G(the size seems right). Due to the large size and limited internet, the process of cache download was not at one go, however it was downloaded using resume downloads “nohup curl -C - -o homo_sapiens_vep_111_GRCh38.tar.gz https://ftp.ensembl.org/pub/release-111/variation/indexed_vep_cache/homo_sapiens_vep_111_GRCh38.tar.gz &” so it seems fine. But I 'll try to re-download. Thanks for your suggestion!

(Moreover, I downloaded the indexed cache, would it be the reason of warning messages?Should I downlad the non-indexed version?)

nakib103 commented 3 months ago

Yes, it does seem like some folders are missing. For example, I do not see the KI270743.1 folder. Also, the indexed cache is the correct one.

bobojin46 commented 3 months ago

hello @nakib, Sorry to bother again. So far I have re-downloaded 111 cahce twice in linux and also downloaded once on my personal computer and then transfered to the server through xftp but got the same result. The zip file are all 23.34G but there are always same folders missing in the unzip homo_sapiens/111_GRCh38 folder.

Since I have downloaded 110 cache file before, the 110 does have extra subfolders like KI270743.1. So the 110 seemed fine. I decided to move the missing subfolder to homo_sapiens/111_GRCh38 and re-run vep command to see if the warning message came out because of these subfolder missing. If so, how can I solve the missing subfolder issue since I have downloaded the cahce several times but the results are the same. Or if the above attemp is fine can I use the combine cache file which contains several missing subfolders copied from 110 cache file.

bobojin46 commented 3 months ago

I have noticed there are other difference between 110 cache and 111 cache. There are some subfolders missing in 111 rather than 110, meanwhile subfolders in 110 cahce there are 1-1000000.gz and 1-1000000_reg.gz, but subfolders in 111 cahce there are only 1-1000000.gz. Is that right? It seemed the problem is on me. In the nearest download attempt, I use "nohup curl -o homo_sapiens_vep_111_GRCh38.tar.gz https://ftp.ensembl.org/pub/release-111/variation/indexed_vep_cache/homo_sapiens_vep_111_GRCh38.tar.gz &" and "tar xzf homo_sapiens_vep_111_GRCh38.tar.gz"

nakib103 commented 3 months ago

Hi @bobojin46,

Sorry, I was looking at a local copy of the cache. I downloaded the cache from the FTP and you are actually right - it is missing some subfolders and _reg.gz files.

I will investigate what happened to the e111 cache and update here once I have more information.

Best regards, Nakib

nakib commented 3 months ago

Hey all,

I would like to stop getting tagged from this project :). I believe you want to tag @nakib103 (your coworker), not @nakib (me, not related to this project).

Best, Nakib

bobojin46 commented 3 months ago

Hi @bobojin46,

Sorry, I was looking at a local copy of the cache. I downloaded the cache from the FTP and you are actually right - it is missing some subfolders and _reg.gz files.

I will investigate what happened to the e111 cache and update here once I have more information.

Best regards, Nakib

Thank you so much. Looking forward to your reply! :)

nakib103 commented 3 months ago

Hello @bobojin46,

We have now updated the cache with the missing subfolder and _reg.gz files and it is now available in the FTP. Please, download the cache files again and re-try running VEP.

Also, further to answer your question, the missing subfolders (sequences like KI270743.1) were due to the missing _reg.gz files which contains regulatory feature data. We only creates a folder for a sequence if it has either overlapping transcript, variant or regulatory features. The subfolders that were missing did not had any transcript or variant but only regulatory features (and thus were not created as regulatory features were omitted previously). So they would be important to you if you are interested in associated regulatory features for your variant.

Best regards, Nakib

nakib103 commented 2 weeks ago

Hello @bobojin46, I will close this issue, if you face further problems please feel free to open a new one.