Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
446 stars 151 forks source link

Segmentation fault using cache and custom annotation #1059

Closed rsalz closed 2 years ago

rsalz commented 2 years ago

Describe the issue

Unable to combine --cache and --gff without getting a 'Segmentation fault (core dumped)'. Reducing the buffer size did not help.

Additional information

I am trying to annotate variants using the information in the cache for GRCh38 but would also like to annotate with my own transcript set. The goal is to get the most 'severe' effect per variant while taking both the reference and the custom transcript annotations into account. I also want to add a few extra things to the output such as allele frequency and polyphen predictions for GRCh38 (even if the most severe effect occurs in the custom annotation!), hence the use of the cache.

As an alternative, I used a gff combiner to combine the reference and custom annotation to do this VEP run without the cache, which worked fine but then I couldn't also annotate the variants with max AF. I even tried adding more '--custom' annotations from VEP's ftp to add the max af but was getting variant identifiers instead of the actually allele frequency... (‘--custom http://ftp.ensembl.org/pub/data_files/homo_sapiens/GRCh38/variation_genotype/ExAC.0.3.GRCh38.vcf.gz,exacAF,vcf,exact,,’)

I know that running GRCh38 and custom gff individually works, but I don't want to run VEP twice and I want the most severe effect while taking both into account. I'm open to any solution that gets me the output I want with 1 VEP run.

System

downloaded VEP using anaconda

Full VEP command line

vep --cache --merged --assembly GRCh38 --cache_version 104 --max_af --sift b --polyphen b --gff transdecoder.gff3.gz --pick_allele --pick_order rank,biotype,length --fasta hg38.fa -i all_patients.vcf -o cache_custom_output.tab

Full error message

Possible precedence issue with control flow operator at /mnt/home2/renees/anaconda3/envs/vep/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805. 2021-09-30 11:13:27 - INFO: BAM-edited cache detected, enabling --use_transcript_ref; use --use_given_ref to override this Segmentation fault (core dumped)

Data files (if applicable)

gff3 file organized and verified using AGAT utility. I know that the gff3 is fine because VEP works properly when annotating using these files without adding the cache. Also VEP works properly when removing the gff file and leaving only the cache.

dglemos commented 2 years ago

Hi @rsalz, I cannot reproduce your error. I'm able to use a VEP command with annotations from both gff file and cache. Could you please:

rsalz commented 2 years ago

The gff file came from TransDecoder. I attached the bgzipped gff file that I use. The command I try to use is the following:

vep --cache --merged --assembly GRCh38 --cache_version 101 --max_af --sift b --polyphen b --protein --domains --gff transdecoder.gff3.gz --pick_allele --pick_order rank,biotype,length --fasta hg38.fa -i all_patients.vcf -o annotated_severe.tab

Just removing --gff transdecoder.gff3.gz is enough to get the command to work.

dglemos commented 2 years ago

I still cannot reproduce the issue. How big is your input file and which types of variants does it have? If you are allow to, could you please email the file to helpdesk@ensembl.org?

rsalz commented 2 years ago

I can't share the variants. So I tried the annotation again with a publicly available variant set (NA12878) and had the same error

dglemos commented 2 years ago

Are you using an indexed cache? http://ftp.ensembl.org/pub/release-104/variation/indexed_vep_cache/

rsalz commented 2 years ago

Yes, I installed the cache using the prompt with the vep_install conda wrapper

dglemos commented 2 years ago

I tested the command with the input file NA12878 and your gff file, still cannot reproduce the error. You could try to remove the cache downloaded with conda and download it manually from here. Documentation about cache can be found here. If that doesn't work, you could download the fasta file following these instructions. The files are available here.

rsalz commented 2 years ago

Neither of these actions fixed my error

dglemos commented 2 years ago

How much time does it take for the job to fail?

rsalz commented 2 years ago

2-3 minutes. There is an output, but never makes it past the variants in chromosome 1. I get about 50K variants annotated before vep suddenly stops

dglemos commented 2 years ago

Do you know how much memory the job uses? Using the file NA12878, which variant is the last one to be annotated before the job fails?

Did you follow these instructions to prepare the fasta file:

curl -O http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa

Could you please run the job again with flag --offline to make sure VEP doesn't connect to the database.

rsalz commented 2 years ago

I tried bgzip, no change. Then I ran it again with the --offline flag and now the error is gone. Thanks for your help!