Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
455 stars 152 forks source link

Synonyms file does not work in the offline mode #1705

Closed XinmengLiao closed 2 months ago

XinmengLiao commented 4 months ago

Describe the issue

Hi. I am currently using VEP v111.0 to annotate my vcf files. When I use --offline mode and provided the synonyms file, errors shows up and some of the chromosomes do not overlap any features.

System

Full VEP command line

vep --cache --dir_cache $VEP_CACHE \
--offline \
--fork 128 \
--format vcf \
--dir_plugins $VEP_plugins111/ \
-i Sample1_PASS.vcf.gz \
-o Sample1_vep_annotated.vcf.gz \
--force_overwrite \
--compress_output bgzip \
--assembly GRCh38 \
--symbol --vcf --check_existing --variant_class \
--sift b --polyphen b \
--synonyms $VEP_CACHE/homo_sapiens/111_GRCh38/chr_synonyms.txt \
--hgvs \
--fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
--canonical \
--af --af_gnomade --af_gnomadg --max_af \
--custom clinvar_20240611_PLPC.vcf.gz,ClinVar,vcf,exact,0,ID,CLNSIG,CLNDN,CLNHGVS,CLNSIGINCL,CLNVC,GENEINFO,CLNDISDB,CLNSIGCONF,CLNREVSTAT,CLNDNINCL,CLNREVSTAT 

Full error message

WARNING: Line 249593 skipped (chr10_GL383545v1_alt 3485 . C CT 9.03 PASS AC=...): Chromosome chr10_GL383545v1_alt not found in annotation sources or synonyms; chromosome chr10_GL383545v1_alt does not overlap any features WARNING: Line 249594 skipped (chr10_GL383545v1_alt 5295 . G GA 11.56 PASS AC...): Chromosome chr10_GL383545v1_alt not found in annotation sources or synonyms; chromosome chr10_GL383545v1_alt does not overlap any features WARNING: Line 249710 skipped (chr10_KI270824v1_alt 73879 . T C 47.19 PASS AC...): Chromosome chr10_KI270824v1_alt not found in annotation sources or synonyms; chromosome chr10_KI270824v1_alt does not overlap any features WARNING: Line 249711 skipped (chr10_KI270824v1_alt 74581 . CGCGGCTTTTTGCACCC...): Chromosome chr10_KI270824v1_alt not found in annotation sources or synonyms; chromosome chr10_KI270824v1_alt does not overlap any features

Additional description

All the chromosome in synonyms form can not be correctly annotated. But all these synonyms can be found in the 'chr_synonyms.txt' as following:

image image

XinmengLiao commented 4 months ago

Additionally, when I turn off the --offline mode, the connection to database failed as following:

MSG: Could not connect to database homo_sapiens_core_110_38 as user anonymous using [DBI:mysql:database=homo_sapiens_core_110_38;host=ensembldb.ensembl.org;port=3306] as a locator: DBI connect('database=homo_sapiens_core_110_38;host=ensembldb.ensembl.org;port=3306','anonymous',...) failed: Can't connect to MySQL server on 'ensembldb.ensembl.org' (110 "Connection timed out") at /sw/bioinfo/vep/110.1/rackham/Bio/EnsEMBL/DBSQL/DBConnection.pm line 260.

nakib103 commented 4 months ago

Hello @XinmengLiao,

Thanks for your reply!

I can re-produce the issue with the --offline mode. The variants are not getting annotated because we are missing some folders required for those regions in the cache. It is happening since e110, I will further investigate the cause on why it is happening.

For the database issue can you make sure that you have the required 3306 port open and it is not a firewall issue.

Best regards, Nakib

nakib103 commented 4 months ago

Hello @XinmengLiao,

I have update about this issue. The missing folder are because of some update to Ensembl core database which affected our VEP cache pipeline since e110. I have added a fix which hope to get in soon and we will have VEP cache with all those sequence region in future releases.

But at present, if you want to annotate those variant I would advise to use VEP version e109.

Best regards, Nakib

XinmengLiao commented 4 months ago

Hello @XinmengLiao,

I have update about this issue. The missing folder are because of some update to Ensembl core database which affected our VEP cache pipeline since e110. I have added a fix which hope to get in soon and we will have VEP cache with all those sequence region in future releases.

But at present, if you want to annotate those variant I would advise to use VEP version e109.

Best regards, Nakib

Thank you so much! Can I use all the Plugins in e109 too? For example AlphaMissense and LOFTEE, which are the new plugins with e111.

nakib103 commented 4 months ago

You can use the e109 version of the cache with the VEP still being on the e111. It will through a warning saying there is a mismatch between the API and the cache version but would not stop the run. If you need to download the cache manually check this page for instruction - https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#cache

It is advisable to have the same version of the code and data otherwise they can be incompatible, but untiil we have a fix you can go with this direction. Let me know if you face any issue running this way.

nakib103 commented 4 months ago

@XinmengLiao, The PR has been merged, so, the cache generated in from next release (e113) will have the fix and contain those missing region.

nakib103 commented 2 months ago

Hi @XinmengLiao,

I will close this issue. But if you face any other issue feel free to open a new one.

Best regards, Nakib