bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
991 stars 354 forks source link

further vep issues #1948

Closed matthdsm closed 7 years ago

matthdsm commented 7 years ago

Hi Brad,

We're almost there when it comes to integrating the new vep. I'm still having some issues though. I think those are mainly because we use an uncompressed fasta as reference for the HGVS notations.

I'm getting the following:

[2017-05-24T11:13Z] Ensembl variant effect predictor : GiaB
[2017-05-24T11:13Z] gzip: stdout: Broken pipe
[2017-05-24T11:29Z] -------------------- EXCEPTION --------------------
[2017-05-24T11:29Z] MSG:  
[2017-05-24T11:29Z] ERROR: Forked process(es) died
[2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::_forked_buffer_to_output /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:362
[2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::next_output_line /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:215
[2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::run /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:117
[2017-05-24T11:29Z] STACK toplevel /home/galaxy/bcbio-dev/anaconda/bin/vep:209
[2017-05-24T11:29Z] Date (localtime)    = Wed May 24 11:29:58 2017
[2017-05-24T11:29Z] Ensembl API version = 88
[2017-05-24T11:29Z] ---------------------------------------------------

I have a feeling the broken pipe error and the fork timeout are correlated.

Also when I run this command outside of bcbio (copied from the commands log):

 unset PERL5LIB && export PATH=/home/galaxy/bcbio-dev/anaconda/bin:$PATH && /home/galaxy/bcbio-dev/anaconda/bin/vep --vcf -o stdout -i /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiaB-joint-gatk-haplotype-joint-decompose.vcf.gz --fork 6 --species homo_sapiens --no_stats --cache --offline --dir /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa --plugin LoF,human_ancestor_fa:false --plugin MaxEntScan,/home/galaxy/bcbio-dev/anaconda/share/maxentscan-0_2004.04.21-0 --plugin GeneSplicer,/home/galaxy/bcbio-dev/anaconda/share/genesplicer-1.0-1/genesplicer,/home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/variation/genesplicer --sift b --polyphen b --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /tmp/bcbio/tmpg1yN0o/GiaB-joint-gatk-haplotype-joint-decompose-vepeffects.vcf.gz
[2017-05-24T11:13Z] unset PERL5LIB && export PATH=/home/galaxy/bcbio-dev/anaconda/bin:$PATH && /home/galaxy/bcbio-dev/anaconda/bin/vep --vcf -o stdout -i /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiaB-joint-gatk-haplotype-joint-decompose.vcf.gz --fork 6 --species homo_sapiens --no_stats --cache --offline --dir /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa --plugin LoF,human_ancestor_fa:false --plugin MaxEntScan,/home/galaxy/bcbio-dev/anaconda/share/maxentscan-0_2004.04.21-0 --plugin GeneSplicer,/home/galaxy/bcbio-dev/anaconda/share/genesplicer-1.0-1/genesplicer,/home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/variation/genesplicer --sift b --polyphen b --hgvs --shift_hgvs 1 --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /tmp/bcbio/tmpvU1lI5/GiaB-joint-gatk-haplotype-joint-decompose-vepeffects.vcf.gz

I get the same error timeout issue, but without the broken pipe.

if I run the exact same command locally, but change the fasta file to an bgzipped and indexed file, everything runs as it should and the exome is processed in about 40 min.

If you have the time, could you run some tests and see if you get the same results?

Thanks M

chapmanb commented 7 years ago

Matthias; Thanks for following up on this. I see the same slow runtime issues with HGVS turned on and a non-compressed fasta file. I'm agreed that the compressed/non-compressed reference is the likely cause of the slow runtimes, and think the slow runtimes are causing the timeout issues. The broken pipe message is present independently and I believe a red herring unrelated to the issue.

It sounds like we'll have to explore using a compressed reference genome to be able to turn HGVS back on for VEP. Definitely happy to help support that if you have time to dig into it. Thanks again.

matthdsm commented 7 years ago

Hi Brad,

Where do I start? HGVS annotations are essential for us, so leaving them out is a no go. I've got a deadline ending at june 9th, so I'd like to get this to work if in possible in any way.

Cheers M

matthdsm commented 7 years ago

Hi Brad, Another (easier) option would be to let the vep installer download and install it's own reference fasta. Normally, it should be possible to then remove the --fasta argument completely, and let VEP detect the fasta file in it's cache directory. The only remaining issue with this is that this approach breaks the genesplicer and maxentscan plugins, as those rely on the the fasta file from the --fasta arg.

M

matthdsm commented 7 years ago

also, there are too many key tools that aren't compatible with compressed fasta files (e.g. most mappers, GATK, ...)

M

matthdsm commented 7 years ago

Hi Brad,

Would it be possible to change config_args = ["--fasta", dd.get_ref_file(data)] here to something along the lines of config_args = ["--fasta", vep_fasta], where vep_fasta represents the fasta file installed in the ensembl cache? I think the path can be parsed from the ensembl_name variable in de the data dictionary.

M

matthdsm commented 7 years ago

It's also possible to add an ensembl_fasta entry to the genome config with the correct filename. What's your take on this?

Cheers M

chapmanb commented 7 years ago

Matthias; Thanks for digging into this more. It's unfortunate a compressed reference won't work globally now. I think the best way to do this is to install the bgzip compressed and tabixed reference alongside the uncompressed reference in the seq directory for human samples. This will also make it available to other programs and hopefully be a migration path to using it more globally going forward.

What builds do you currently need this for? Is hg38 sufficient or do you also need hg19/GRCh37? Thanks again for all the background and digging with this.

matthdsm commented 7 years ago

Hi Brad

I think this currently is only relevant for the human builds, since the main issues are with the HGVS annotations. Perhaps this can be expanded as tools evolve.

Cheers M

chapmanb commented 7 years ago

Matthias -- thanks, this makes sense and we'll definitely only start with human. I was also hoping to add this only to hg38 as a first pass. Is that sufficient, or will you need hg19/GRCh37 as well?

matthdsm commented 7 years ago

Brad,

I'm don't mind having this for hg38 only, we're not using any other builds, so no objections here.

Thanks a lot! M

naumenko-sa commented 7 years ago

Hi Brad and Matthias! How the annotation supposed to work after the upgrade for a GRCh37 user? Just slow HGVS or no HGVS by default? Thanks! SN

naumenko-sa commented 7 years ago

It looks like no HGVS: old command:

unset PERL5LIB && export PATH=/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/bin:$PATH && /home/naumenko/work/tools/bcbio/anaconda/bin/vep --vcf -o stdout -i /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/samtools/235.vcf.gz --fork 7 --species homo_sapiens --no_stats --cache --merged --offline --dir /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --plugin LoF,human_ancestor_fa:/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/variation/human_ancestor.fa.gz --sift b --polyphen b --hgvs --shift_hgvs 1 --fasta /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa | sed '/^#/! s/;;/;/g' | bgzip -c > /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/bcbiotx/tmpIdqoti/235-vepeffects.vcf.gz

new command:

unset PERL5LIB && export PATH=/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/bin:$PATH && /home/naumenko/work/tools/bcbio/anaconda/bin/vep --vcf -o stdout -i /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/samtools/235.vcf.gz --fork 7 --species homo_sapiens --no_stats --cache --offline --dir /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --plugin LoF,human_ancestor_fa:/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/variation/human_ancestor.fa.gz --sift b --polyphen b --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/bcbiotx/tmp7EjQ3Z/235-vepeffects.vcf.gz

Is it possible to get HGVS back to where it once belonged (GRCh37)? I think many users are still on this reference?

Thanks! Sergey

matthdsm commented 7 years ago

Hi Sergey,

Currently, there's no HGVS for anyone (see here).

I do think it would be possible to add a gzipped reference for the other human build, or at least build in a check that uses the uncompressed fasta when no compressed reference is available. The hg38 only solution is only for testing purposes, that is unless Brad has an issue with adding more references.

Cheers M

matthdsm commented 7 years ago

Hi Brad,

How do you propose we tackle this? I suppose this can be as easy as adding "ref_file_compressed": {"keys": ["reference", "fasta", "compressed"]} to the LOOKUPS in pipeline/datadict.py, and referencing it variation/effects.py as dd.get_ref_file_compressed(data).

Does this sound correct to you? I'm pretty sure I'm still missing something somewhere.

Cheers M

naumenko-sa commented 7 years ago

Thanks Matthias for the clarification!

Guys, I am still a bit puzzled:

  1. For a long time HGVS was off by default, and we had a 'clinical_reporting' parameter to turn HGVS on. Those who wanted HGVS could turn it on at the price of speed.
  2. In 1.0.1 (January 2017) HGVS became on by default, with 'clinical_reporting' deprecated.
  3. In May 2017 HGVS becomes off by default without option to turn it on and documented only as a comment in the code.

Whatever tactical benefits of this decision, does it look like a consistent development strategy?

SN

chapmanb commented 7 years ago

Sergey -- sorry, this is not a development strategy for sure. We transitioned to use the new VEP tool and it was incredibly slow with HGVS so we had to disable it. We're hoping to stabilize and re-enable HGVS before the next release. Apologies about the pain during the transition.

Matthias -- I can start to tackle this. The main steps are including as part of the download, adding into the files we retrieve, then adding a lookup as you suggest above. I'll start with updating the downloads to include the compressed reference for hg19, GRCh37 and hg38 and proceed from there. Does that work?

matthdsm commented 7 years ago

Hi Brad,

Works just fine for me, I'm eager to start testing, and get this stable.

Cheers M

matthdsm commented 7 years ago

Hi Brad,

During some tests outside of bcbio I encountered another issue regarding the HGVS annotations. When using a bgzipped version of hg38.fa included in bcbio, I'm getting faulty HGVS notations, with lot's of N instead of the correct sequence from the fasta file. VEP docs say this can be caused by a faulty index, but no matter how much I reindex, same issue still arises.

I then tested the same command, but swapped out the bcbio included reference with the ensembl reference (downloaded with vep_install -a cf .... --CONVERT). This yielded correct HGVS notations.

So my question is, are you aware of some differences between the ensembl ref genome en the bcbio ref genome (which is NCBI if I'm not mistaken)?

ping @willmclaren, Is there any reason the HGVS annotations are off when using a different reference fasta from the one included?

Cheers M

willmclaren commented 7 years ago

You might find that if the chromosome names differ between FASTA, VEP's cache and your input this can cause this issue.

VEP has a chromosome synonyms functionality to deal with this; VEP caches are shipped with a synonyms file which should be autodetected and loaded on startup, but you may provide an additional synonyms file with --synonyms. See here and here for more details.

matthdsm commented 7 years ago

Hi Will,

Thanks for chiming in. The default fasta we use, uses the chr notations, which are already defined in the synonyms file

[login] matdsmet:~ $ cat /bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa | grep chr | head -n 10                                                                                                                                                                                                                                                                                                                     [14:00:10]
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
>chr2  AC:CM000664.2  gi:568336022  LN:242193529  rl:Chromosome  M5:f98db672eb0993dcfdabafe2a882905c  AS:GRCh38
>chr3  AC:CM000665.2  gi:568336021  LN:198295559  rl:Chromosome  M5:76635a41ea913a405ded820447d067b0  AS:GRCh38
>chr4  AC:CM000666.2  gi:568336020  LN:190214555  rl:Chromosome  M5:3210fecf1eb92d5489da4346b3fddc6e  AS:GRCh38
>chr5  AC:CM000667.2  gi:568336019  LN:181538259  rl:Chromosome  M5:a811b3dc9fe66af729dc0dddf7fa4f13  AS:GRCh38  hm:47309185-49591369
>chr6  AC:CM000668.2  gi:568336018  LN:170805979  rl:Chromosome  M5:5691468a67c7e7a7b5f2a3a683792c29  AS:GRCh38
>chr7  AC:CM000669.2  gi:568336017  LN:159345973  rl:Chromosome  M5:cc044cc2256a1141212660fb07b6171e  AS:GRCh38
>chr8  AC:CM000670.2  gi:568336016  LN:145138636  rl:Chromosome  M5:c67955b5f7815a9a1edfaa15893d3616  AS:GRCh38
>chr9  AC:CM000671.2  gi:568336015  LN:138394717  rl:Chromosome  M5:6c198acf68b5af7b9d676dfdd531b5de  AS:GRCh38
>chr10  AC:CM000672.2  gi:568336014  LN:133797422  rl:Chromosome  M5:c0eeee7acfdaf31b770a509bdaa6e51a  AS:GRCh38
...

Could this have something to do with sorting? I see in the Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz file, the chromosomes are sorted differently

[login] matdsmet:~ $ zcat /bcbio-dev/genomes/Hsapiens/hg38/vep/homo_sapiens_merged/88_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz | grep chr                                                                                                                                                                                                                                                            [14:02:41]
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF
>13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF
>14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF
>15 dna:chromosome chromosome:GRCh38:15:1:101991189:1 REF
>16 dna:chromosome chromosome:GRCh38:16:1:90338345:1 REF
>17 dna:chromosome chromosome:GRCh38:17:1:83257441:1 REF
>18 dna:chromosome chromosome:GRCh38:18:1:80373285:1 REF
...

The fasta I'm testing this with is bgzipped and indexed with samtools faidx. Do you know of any other steps that should be taken?

Cheers M

willmclaren commented 7 years ago

I wonder if samtools faidx produces the exact same index as Bio::DB::HTS::Faidx does; my guess would be that they should as both are obviously htslib-based, but there's a chance they're not.

Move the index file(s) generated by samtools out of the way and run VEP to let it create the index, see if that works.

matthdsm commented 7 years ago

Thanks, I'll give that a shot.

M

chapmanb commented 7 years ago

Matthias; Thanks for sanity checking these. I've pushed and tested inclusion of bgzip fasta files and samtools faidx indexes with the standard bcbio upgrade. If you run bcbio_nextgen.py upgrade --data it should grab the human references with these files for any you're currently working on. The next step on the bcbio side will be to identity these when present and use for VEP, and I'll work on the wiring for that. It might be worth testing the bcbio prepared downloads to see if they have issues with missing bases as well, and happy to figure out how best to adjust our distribution to fix this.

matthdsm commented 7 years ago

Hi Brad,

Thanks for all the work. I'm upgrading right away. In the meantime, Will's suggestion of letting perl handle the index generation fixed the faulty HGVS notations for me. I'm still struggling with the forks dying though.

I'll be sure to test the newly dowloaded files as soon as they're in.

M

matthdsm commented 7 years ago

Hi Brad,

Sorry do disappoint, but after several tries, I'm still getting the same issue. Would you be open to leaving out the indices from the default download, and let them be generated on the fly by Bio::DB::HTS::Faidx? It does seem there's still a difference.

M

chapmanb commented 7 years ago

Matthias; Thanks for testing this more. Unfortunately letting VEP generate the index is not a great option since it assumes that the user running a job has permissions to write to the genomes seq directory. We'd like to ship everything needed and can generate this index however works best and provide that.

Could you provide some test locations where the shipped indexes (created with samtools faidx) return N and the Bio::DB::HTS::Faidx indexes do not? I'm trying to replicate the issue outside of VEP so I can be sure we've got it fixed. In my tests both work fine so I'm not sure in what cases it causes issues. Thanks much.

matthdsm commented 7 years ago

Hi Brad, We're testing with a GiaB exome sample and getting errors pretty much on every position. I've included a vcf with the faulty annotations. Hope this helps. GiaB-vepeffects.vcf.gz

matthdsm commented 7 years ago

Hi Brad,

Don't bother testing, I just did a diff of both your index and one created by the perl module. Both files are identical. Now I really have no clue what's the issue. It just seems to be so random..

@willmclaren Any more ideas?

M

chapmanb commented 7 years ago

Matthias; Thanks for following up with testing. I'm also confused as to what is going on. From reading the VEP and Ensembl code it doesn't look like anything special is going on with the faidx indexing and I can't manage to replicate by doing retrieval/index lookup directly using the Perl Bio::DB::HTS::Faidx module.

I finalized putting the infrastructure in place so now bcbio will use the compressed indexes for VEP. If present, this turns HGVS back on so should restore the previous functionality provided we can resolve the lookup issue. I'll work on testing this in the context to bcbio to see if I can replicate and debug further as well, but any ideas would be welcome.

chapmanb commented 7 years ago

Matthias and Will; In my testing with NA12878 and the updated bcbio infrastructure I think I've identified the difference in lookups. The problem N lookups appear when you run with multiple threads. I'll have to defer to Will on how best to debug from here.

For single core runs, the HGVS fasta lookups are working and reference bases get filled in correctly. This is on hg38 using the latest development version of bcbio and references downloaded with the ggd recipe we updated yesterday, using a single core:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878-1
chr1    13813   .       T       G       11.46   PASS    AB=0.666667;ABP=3.73412;AC=1;AF=0.5;AN=2;AO=2;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=3.0103;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=19;MQMR=22;NS=1;NUMALT=1;ODDS=2.55439;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=82;QR=40;RO=1;RPL=1;RPP=3.0103;RPPR=5.18177;RPR=1;RUN=1;SAF=0;SAP=7.35324;SAR=2;SRF=0;SRP=5.18177;SRR=1;TYPE=snp;technology.illumina=1;CSQ=G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||143|1||SNV|HGNC|HGNC:37102||||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|3/3||ENST00000456328.2:n.1061T>G||1061/1657|||||||1||SNV|HGNC|HGNC:37102|YES|1||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||591|-1||SNV|HGNC|HGNC:38034|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|ENSG00000278267|Transcript|ENST00000619216|miRNA|||||||||||3556|-1||SNV|HGNC|HGNC:50039|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|653635|Transcript|NR_024540.1|misc_RNA|||||||||||549|-1||SNV|EntrezGene|HGNC:38034|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|100287102|Transcript|NR_046018.2|misc_RNA|3/3||NR_046018.2:n.1056T>G||1056/1652|||||||1||SNV|EntrezGene|HGNC:37102|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|102466751|Transcript|NR_106918.1|miRNA|||||||||||3556|-1||SNV|EntrezGene|HGNC:50039|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||    GT:GQ:DP:AD:RO:QR:AO:QA:GL      0/1:0:3:1,2:1:40:2:82:-2.70051,0,-1.29012

But if I use the same code and indexes with 2 cores (--fork 2), I see the problem you identified:

chr1    13813   .       T       G       11.46   PASS    AB=0.666667;ABP=3.73412;AC=1;AF=0.5;AN=2;AO=2;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=3.0103;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=19;MQMR=22;NS=1;NUMALT=1;ODDS=2.55439;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=82;QR=40;RO=1;RPL=1;RPP=3.0103;RPPR=5.18177;RPR=1;RUN=1;SAF=0;SAP=7.35324;SAR=2;SRF=0;SRP=5.18177;SRR=1;TYPE=snp;technology.illumina=1;CSQ=G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||143|1||SNV|HGNC|HGNC:37102||||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|3/3||ENST00000456328.2:n.1061N>G||1061/1657|||||||1||SNV|HGNC|HGNC:37102|YES|1||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||591|-1||SNV|HGNC|HGNC:38034|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|ENSG00000278267|Transcript|ENST00000619216|miRNA|||||||||||3556|-1||SNV|HGNC|HGNC:50039|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|653635|Transcript|NR_024540.1|misc_RNA|||||||||||549|-1||SNV|EntrezGene|HGNC:38034|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|100287102|Transcript|NR_046018.2|misc_RNA|3/3||NR_046018.2:n.1056N>G||1056/1652|||||||1||SNV|EntrezGene|HGNC:37102|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|102466751|Transcript|NR_106918.1|miRNA|||||||||||3556|-1||SNV|EntrezGene|HGNC:50039|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||    GT:GQ:DP:AD:RO:QR:AO:QA:GL      0/1:0:3:1,2:1:40:2:82:-2.70051,0,-1.29012

So this does not appear to be index dependent, but rather something different happening during parallelization. Will, do you have any clues to help us dig further to resolve?

willmclaren commented 7 years ago

VEP has code in place specifically to address this (this if you're interested). Basically when the forked process is created the "connection" to the FASTA index needs to be renewed.

This functionality is not specifically tested in the test suite, but I'll write up some tests for it.

Here is what I've done to test manually, using the example file for GRCh38 provided in the VEP repo:

$ ./vep -i examples/homo_sapiens_GRCh38.vcf -force -cache -fasta ~/.vep/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -hgvs -tab -fields HGVSc -pick -o stdout | grep -v "#" | sort -u > a
$ ./vep -i examples/homo_sapiens_GRCh38.vcf -force -cache -fasta ~/.vep/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -hgvs -tab -fields HGVSc -pick -o stdout -fork 2 | grep -v "#" | sort -u > b
$ diff a b
$
$ head a
-
ENST00000215781.2:c.318C>G
ENST00000215781.2:c.648G>A
ENST00000215812.8:c.1005T>G
ENST00000215812.8:c.1090C>T
ENST00000215812.8:c.118C>T
ENST00000215812.8:c.133C>T
ENST00000215812.8:c.308T>C
ENST00000215812.8:c.389G>A
ENST00000215812.8:c.528C>T

Obviously you might have to change the path to the FASTA to get it to work in your environment.

chapmanb commented 7 years ago

Will; Thanks for the test script and helping debug this. I can confirm that I can replicate the problem here with this input. After running on this minimal test case a bunch of times I take back my hypothesis that it's due to --fork and will go with Matthias' claim that it's random. Here are 3 consecutive runs of the same script where I modified your test example to count the number of N> calls. The first two times I see a bunch of them (but different numbers) and the third everything is fine:

$ bash test_vep.sh
172
110
$ bash test_vep.sh
172
72
$ bash test_vep.sh
0

Here's the version we're running:

  ensembl              : 88.c3093a6
  ensembl-funcgen      : 88.3e63130
  ensembl-io           : 88.277fe7c
  ensembl-variation    : 88.85fee8f
  ensembl-vep          : 88.10

and the script I ran:

#!/bin/bash
set -eu -o pipefail
unset PERL5LIB
export PATH=/mnt/work/bcbio/anaconda/bin:$PATH

vep | grep ensembl
vep -i homo_sapiens_GRCh38.vcf  --species homo_sapiens --merged -force --offline -cache -fasta /mnt/work/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.gz -hgvs -tab -fields HGVSc --dir /mnt/work/bcbio/genomes/Hsapiens/hg38/vep -pick -o stdout | grep -v "#" | sort -u > a
grep -c 'N>' a
vep -i homo_sapiens_GRCh38.vcf --species homo_sapiens --merged -force --offline -cache -fasta /mnt/work/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.gz -hgvs -tab -fields HGVSc --dir /mnt/work/bcbio/genomes/Hsapiens/hg38/vep --fork 2 -pick -o stdout | grep -v "#" | sort -u > b
grep -c 'N>' b

So I'm thoroughly confused now as well. Thanks for any tips/tricks that might help debug this further.

matthdsm commented 7 years ago

Hi Brad, Will,

Thanks for helping look into this. When Brad said everything worked fine for him, I was starting to think the issue was specific for us. If there's anything I can do to help, please let me know.

M

willmclaren commented 7 years ago

Finally, tracked it down!

There was a bug in our module for retrieving sequence from the FASTA index; if you ask for the length of an absent chromosome name, it returns -1, not 0 or undef. This means that an incorrect synonym might be used when mapping between the input and FASTA chromosome names, and when you call keys() on a Perl hash the returned order is deliberately random, hence the random failures.

I've patched a fix to the ensembl-variation module where this is found. You'll need to re-run VEP's INSTALL.pl script to pick it up.

matthdsm commented 7 years ago

Allright! Great to hear! Thanks for tracking down the issue! @chapmanb, I suppose this will require a rebuild of the conda package, so the new perl modules are pulled in? I'll prepare a PR at bioconda.

Cheers M

Update: PR @ bioconda/bioconda-recipes#4881

chapmanb commented 7 years ago

Brilliant. Thank you Will for identifying the issue and the quick fix. Matthias, thanks for building the package. It looks like the fix went into ensembl-variation so if I understand correctly a re-install build with an updated bioconda package will grab the latest from release/89 and have the fix in it. Thank you for taking care of that.

naumenko-sa commented 7 years ago

Works for HGVS on grch37. Thanks!

matthdsm commented 7 years ago

Fixed the issue for me on hg38! Thanks everyone!

M

chapmanb commented 7 years ago

Thanks all for the help, development and testing work. It's great to have bcbio working with the new VEP.

martbro commented 4 years ago

Hi Brad,

We're almost there when it comes to integrating the new vep. I'm still having some issues though. I think those are mainly because we use an uncompressed fasta as reference for the HGVS notations.

I'm getting the following:

[2017-05-24T11:13Z] Ensembl variant effect predictor : GiaB
[2017-05-24T11:13Z] gzip: stdout: Broken pipe
[2017-05-24T11:29Z] -------------------- EXCEPTION --------------------
[2017-05-24T11:29Z] MSG:  
[2017-05-24T11:29Z] ERROR: Forked process(es) died
[2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::_forked_buffer_to_output /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:362
[2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::next_output_line /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:215
[2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::run /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:117
[2017-05-24T11:29Z] STACK toplevel /home/galaxy/bcbio-dev/anaconda/bin/vep:209
[2017-05-24T11:29Z] Date (localtime)    = Wed May 24 11:29:58 2017
[2017-05-24T11:29Z] Ensembl API version = 88
[2017-05-24T11:29Z] ---------------------------------------------------

I have a feeling the broken pipe error and the fork timeout are correlated.

Also when I run this command outside of bcbio (copied from the commands log):

 unset PERL5LIB && export PATH=/home/galaxy/bcbio-dev/anaconda/bin:$PATH && /home/galaxy/bcbio-dev/anaconda/bin/vep --vcf -o stdout -i /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiaB-joint-gatk-haplotype-joint-decompose.vcf.gz --fork 6 --species homo_sapiens --no_stats --cache --offline --dir /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa --plugin LoF,human_ancestor_fa:false --plugin MaxEntScan,/home/galaxy/bcbio-dev/anaconda/share/maxentscan-0_2004.04.21-0 --plugin GeneSplicer,/home/galaxy/bcbio-dev/anaconda/share/genesplicer-1.0-1/genesplicer,/home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/variation/genesplicer --sift b --polyphen b --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /tmp/bcbio/tmpg1yN0o/GiaB-joint-gatk-haplotype-joint-decompose-vepeffects.vcf.gz
[2017-05-24T11:13Z] unset PERL5LIB && export PATH=/home/galaxy/bcbio-dev/anaconda/bin:$PATH && /home/galaxy/bcbio-dev/anaconda/bin/vep --vcf -o stdout -i /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiaB-joint-gatk-haplotype-joint-decompose.vcf.gz --fork 6 --species homo_sapiens --no_stats --cache --offline --dir /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa --plugin LoF,human_ancestor_fa:false --plugin MaxEntScan,/home/galaxy/bcbio-dev/anaconda/share/maxentscan-0_2004.04.21-0 --plugin GeneSplicer,/home/galaxy/bcbio-dev/anaconda/share/genesplicer-1.0-1/genesplicer,/home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/variation/genesplicer --sift b --polyphen b --hgvs --shift_hgvs 1 --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /tmp/bcbio/tmpvU1lI5/GiaB-joint-gatk-haplotype-joint-decompose-vepeffects.vcf.gz

I get the same error timeout issue, but without the broken pipe.

if I run the exact same command locally, but change the fasta file to an bgzipped and indexed file, everything runs as it should and the exome is processed in about 40 min.

If you have the time, could you run some tests and see if you get the same results?

Thanks M

Hi Matthias, I am running the following command: vep --cache --symbol --input_file $input --output_file $output

But I am getting your broken pipe error: gzip: stdout: Broken pipe

Everything seems to run but takes forever. I am quite new with vep, let me know if you have any idea / tests ?

Thanks a lot Martin

naumenko-sa commented 4 years ago

Hi Martin!

I've just annotated 1520 variants with 5 threads in under 2min using bcbio installation and https://github.com/naumenko-sa/bioscripts/blob/master/variants/vep.sh

Make sure your vep installation (--tools) updates matches with data installation (vep cache).

Sergey

martbro commented 4 years ago

Hi Martin!

I've just annotated 1520 variants with 5 threads in under 2min using bcbio installation and https://github.com/naumenko-sa/bioscripts/blob/master/variants/vep.sh

Make sure your vep installation (--tools) updates matches with data installation (vep cache).

Sergey

Thank you for your answer, I changed --cache to --offline and this fixed my gzip: stdout: Broken pipe error