Closed matthdsm closed 7 years ago
Matthias; Thanks for following up on this. I see the same slow runtime issues with HGVS turned on and a non-compressed fasta file. I'm agreed that the compressed/non-compressed reference is the likely cause of the slow runtimes, and think the slow runtimes are causing the timeout issues. The broken pipe message is present independently and I believe a red herring unrelated to the issue.
It sounds like we'll have to explore using a compressed reference genome to be able to turn HGVS back on for VEP. Definitely happy to help support that if you have time to dig into it. Thanks again.
Hi Brad,
Where do I start? HGVS annotations are essential for us, so leaving them out is a no go. I've got a deadline ending at june 9th, so I'd like to get this to work if in possible in any way.
Cheers M
Hi Brad,
Another (easier) option would be to let the vep installer download and install it's own reference fasta.
Normally, it should be possible to then remove the --fasta
argument completely, and let VEP detect the fasta file in it's cache directory.
The only remaining issue with this is that this approach breaks the genesplicer and maxentscan plugins, as those rely on the the fasta file from the --fasta
arg.
M
also, there are too many key tools that aren't compatible with compressed fasta files (e.g. most mappers, GATK, ...)
M
Hi Brad,
Would it be possible to change config_args = ["--fasta", dd.get_ref_file(data)]
here to something along the lines of config_args = ["--fasta", vep_fasta]
, where vep_fasta
represents the fasta file installed in the ensembl cache?
I think the path can be parsed from the ensembl_name
variable in de the data dictionary.
M
It's also possible to add an ensembl_fasta
entry to the genome config with the correct filename.
What's your take on this?
Cheers M
Matthias;
Thanks for digging into this more. It's unfortunate a compressed reference won't work globally now. I think the best way to do this is to install the bgzip compressed and tabixed reference alongside the uncompressed reference in the seq
directory for human samples. This will also make it available to other programs and hopefully be a migration path to using it more globally going forward.
What builds do you currently need this for? Is hg38 sufficient or do you also need hg19/GRCh37? Thanks again for all the background and digging with this.
Hi Brad
I think this currently is only relevant for the human builds, since the main issues are with the HGVS annotations. Perhaps this can be expanded as tools evolve.
Cheers M
Matthias -- thanks, this makes sense and we'll definitely only start with human. I was also hoping to add this only to hg38 as a first pass. Is that sufficient, or will you need hg19/GRCh37 as well?
Brad,
I'm don't mind having this for hg38 only, we're not using any other builds, so no objections here.
Thanks a lot! M
Hi Brad and Matthias! How the annotation supposed to work after the upgrade for a GRCh37 user? Just slow HGVS or no HGVS by default? Thanks! SN
It looks like no HGVS: old command:
unset PERL5LIB && export PATH=/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/bin:$PATH && /home/naumenko/work/tools/bcbio/anaconda/bin/vep --vcf -o stdout -i /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/samtools/235.vcf.gz --fork 7 --species homo_sapiens --no_stats --cache --merged --offline --dir /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --plugin LoF,human_ancestor_fa:/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/variation/human_ancestor.fa.gz --sift b --polyphen b --hgvs --shift_hgvs 1 --fasta /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa | sed '/^#/! s/;;/;/g' | bgzip -c > /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/bcbiotx/tmpIdqoti/235-vepeffects.vcf.gz
new command:
unset PERL5LIB && export PATH=/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/bin:$PATH && /home/naumenko/work/tools/bcbio/anaconda/bin/vep --vcf -o stdout -i /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/samtools/235.vcf.gz --fork 7 --species homo_sapiens --no_stats --cache --offline --dir /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --plugin LoF,human_ancestor_fa:/hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/variation/human_ancestor.fa.gz --sift b --polyphen b --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /hpf/largeprojects/ccmbio/naumenko/project_cheo/2017-05-25_235/235/work/bcbiotx/tmp7EjQ3Z/235-vepeffects.vcf.gz
Is it possible to get HGVS back to where it once belonged (GRCh37)? I think many users are still on this reference?
Thanks! Sergey
Hi Sergey,
Currently, there's no HGVS for anyone (see here).
I do think it would be possible to add a gzipped reference for the other human build, or at least build in a check that uses the uncompressed fasta when no compressed reference is available. The hg38 only solution is only for testing purposes, that is unless Brad has an issue with adding more references.
Cheers M
Hi Brad,
How do you propose we tackle this? I suppose this can be as easy as adding "ref_file_compressed": {"keys": ["reference", "fasta", "compressed"]}
to the LOOKUPS
in pipeline/datadict.py
, and referencing it variation/effects.py
as dd.get_ref_file_compressed(data)
.
Does this sound correct to you? I'm pretty sure I'm still missing something somewhere.
Cheers M
Thanks Matthias for the clarification!
Guys, I am still a bit puzzled:
Whatever tactical benefits of this decision, does it look like a consistent development strategy?
SN
Sergey -- sorry, this is not a development strategy for sure. We transitioned to use the new VEP tool and it was incredibly slow with HGVS so we had to disable it. We're hoping to stabilize and re-enable HGVS before the next release. Apologies about the pain during the transition.
Matthias -- I can start to tackle this. The main steps are including as part of the download, adding into the files we retrieve, then adding a lookup as you suggest above. I'll start with updating the downloads to include the compressed reference for hg19, GRCh37 and hg38 and proceed from there. Does that work?
Hi Brad,
Works just fine for me, I'm eager to start testing, and get this stable.
Cheers M
Hi Brad,
During some tests outside of bcbio I encountered another issue regarding the HGVS annotations.
When using a bgzipped version of hg38.fa included in bcbio, I'm getting faulty HGVS notations, with lot's of N
instead of the correct sequence from the fasta file.
VEP docs say this can be caused by a faulty index, but no matter how much I reindex, same issue still arises.
I then tested the same command, but swapped out the bcbio included reference with the ensembl reference (downloaded with vep_install -a cf .... --CONVERT
). This yielded correct HGVS notations.
So my question is, are you aware of some differences between the ensembl ref genome en the bcbio ref genome (which is NCBI if I'm not mistaken)?
ping @willmclaren, Is there any reason the HGVS annotations are off when using a different reference fasta from the one included?
Cheers M
You might find that if the chromosome names differ between FASTA, VEP's cache and your input this can cause this issue.
VEP has a chromosome synonyms functionality to deal with this; VEP caches are shipped with a synonyms file which should be autodetected and loaded on startup, but you may provide an additional synonyms file with --synonyms
. See here and here for more details.
Hi Will,
Thanks for chiming in. The default fasta we use, uses the chr
notations, which are already defined in the synonyms file
[login] matdsmet:~ $ cat /bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa | grep chr | head -n 10 [14:00:10]
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
>chr2 AC:CM000664.2 gi:568336022 LN:242193529 rl:Chromosome M5:f98db672eb0993dcfdabafe2a882905c AS:GRCh38
>chr3 AC:CM000665.2 gi:568336021 LN:198295559 rl:Chromosome M5:76635a41ea913a405ded820447d067b0 AS:GRCh38
>chr4 AC:CM000666.2 gi:568336020 LN:190214555 rl:Chromosome M5:3210fecf1eb92d5489da4346b3fddc6e AS:GRCh38
>chr5 AC:CM000667.2 gi:568336019 LN:181538259 rl:Chromosome M5:a811b3dc9fe66af729dc0dddf7fa4f13 AS:GRCh38 hm:47309185-49591369
>chr6 AC:CM000668.2 gi:568336018 LN:170805979 rl:Chromosome M5:5691468a67c7e7a7b5f2a3a683792c29 AS:GRCh38
>chr7 AC:CM000669.2 gi:568336017 LN:159345973 rl:Chromosome M5:cc044cc2256a1141212660fb07b6171e AS:GRCh38
>chr8 AC:CM000670.2 gi:568336016 LN:145138636 rl:Chromosome M5:c67955b5f7815a9a1edfaa15893d3616 AS:GRCh38
>chr9 AC:CM000671.2 gi:568336015 LN:138394717 rl:Chromosome M5:6c198acf68b5af7b9d676dfdd531b5de AS:GRCh38
>chr10 AC:CM000672.2 gi:568336014 LN:133797422 rl:Chromosome M5:c0eeee7acfdaf31b770a509bdaa6e51a AS:GRCh38
...
Could this have something to do with sorting? I see in the Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
file, the chromosomes are sorted differently
[login] matdsmet:~ $ zcat /bcbio-dev/genomes/Hsapiens/hg38/vep/homo_sapiens_merged/88_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz | grep chr [14:02:41]
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF
>13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF
>14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF
>15 dna:chromosome chromosome:GRCh38:15:1:101991189:1 REF
>16 dna:chromosome chromosome:GRCh38:16:1:90338345:1 REF
>17 dna:chromosome chromosome:GRCh38:17:1:83257441:1 REF
>18 dna:chromosome chromosome:GRCh38:18:1:80373285:1 REF
...
The fasta I'm testing this with is bgzipped and indexed with samtools faidx
. Do you know of any other steps that should be taken?
Cheers M
I wonder if samtools faidx produces the exact same index as Bio::DB::HTS::Faidx does; my guess would be that they should as both are obviously htslib-based, but there's a chance they're not.
Move the index file(s) generated by samtools out of the way and run VEP to let it create the index, see if that works.
Thanks, I'll give that a shot.
M
Matthias;
Thanks for sanity checking these. I've pushed and tested inclusion of bgzip fasta files and samtools faidx indexes with the standard bcbio upgrade. If you run bcbio_nextgen.py upgrade --data
it should grab the human references with these files for any you're currently working on. The next step on the bcbio side will be to identity these when present and use for VEP, and I'll work on the wiring for that. It might be worth testing the bcbio prepared downloads to see if they have issues with missing bases as well, and happy to figure out how best to adjust our distribution to fix this.
Hi Brad,
Thanks for all the work. I'm upgrading right away. In the meantime, Will's suggestion of letting perl handle the index generation fixed the faulty HGVS notations for me. I'm still struggling with the forks dying though.
I'll be sure to test the newly dowloaded files as soon as they're in.
M
Hi Brad,
Sorry do disappoint, but after several tries, I'm still getting the same issue.
Would you be open to leaving out the indices from the default download, and let them be generated on the fly by Bio::DB::HTS::Faidx
? It does seem there's still a difference.
M
Matthias; Thanks for testing this more. Unfortunately letting VEP generate the index is not a great option since it assumes that the user running a job has permissions to write to the genomes seq directory. We'd like to ship everything needed and can generate this index however works best and provide that.
Could you provide some test locations where the shipped indexes (created with samtools faidx
) return N
and the Bio::DB::HTS::Faidx
indexes do not? I'm trying to replicate the issue outside of VEP so I can be sure we've got it fixed. In my tests both work fine so I'm not sure in what cases it causes issues. Thanks much.
Hi Brad, We're testing with a GiaB exome sample and getting errors pretty much on every position. I've included a vcf with the faulty annotations. Hope this helps. GiaB-vepeffects.vcf.gz
Hi Brad,
Don't bother testing, I just did a diff of both your index and one created by the perl module. Both files are identical. Now I really have no clue what's the issue. It just seems to be so random..
@willmclaren Any more ideas?
M
Matthias;
Thanks for following up with testing. I'm also confused as to what is going on. From reading the VEP and Ensembl code it doesn't look like anything special is going on with the faidx indexing and I can't manage to replicate by doing retrieval/index lookup directly using the Perl Bio::DB::HTS::Faidx
module.
I finalized putting the infrastructure in place so now bcbio will use the compressed indexes for VEP. If present, this turns HGVS back on so should restore the previous functionality provided we can resolve the lookup issue. I'll work on testing this in the context to bcbio to see if I can replicate and debug further as well, but any ideas would be welcome.
Matthias and Will;
In my testing with NA12878 and the updated bcbio infrastructure I think I've identified the difference in lookups. The problem N
lookups appear when you run with multiple threads. I'll have to defer to Will on how best to debug from here.
For single core runs, the HGVS fasta lookups are working and reference bases get filled in correctly. This is on hg38 using the latest development version of bcbio and references downloaded with the ggd recipe we updated yesterday, using a single core:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878-1
chr1 13813 . T G 11.46 PASS AB=0.666667;ABP=3.73412;AC=1;AF=0.5;AN=2;AO=2;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=3.0103;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=19;MQMR=22;NS=1;NUMALT=1;ODDS=2.55439;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=82;QR=40;RO=1;RPL=1;RPP=3.0103;RPPR=5.18177;RPR=1;RUN=1;SAF=0;SAP=7.35324;SAR=2;SRF=0;SRP=5.18177;SRR=1;TYPE=snp;technology.illumina=1;CSQ=G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||143|1||SNV|HGNC|HGNC:37102||||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|3/3||ENST00000456328.2:n.1061T>G||1061/1657|||||||1||SNV|HGNC|HGNC:37102|YES|1||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||591|-1||SNV|HGNC|HGNC:38034|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|ENSG00000278267|Transcript|ENST00000619216|miRNA|||||||||||3556|-1||SNV|HGNC|HGNC:50039|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|653635|Transcript|NR_024540.1|misc_RNA|||||||||||549|-1||SNV|EntrezGene|HGNC:38034|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|100287102|Transcript|NR_046018.2|misc_RNA|3/3||NR_046018.2:n.1056T>G||1056/1652|||||||1||SNV|EntrezGene|HGNC:37102|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|102466751|Transcript|NR_106918.1|miRNA|||||||||||3556|-1||SNV|EntrezGene|HGNC:50039|YES||||||||rseq_mrna_match|RefSeq|||||||||||||||||||||||||||||||||||| GT:GQ:DP:AD:RO:QR:AO:QA:GL 0/1:0:3:1,2:1:40:2:82:-2.70051,0,-1.29012
But if I use the same code and indexes with 2 cores (--fork 2
), I see the problem you identified:
chr1 13813 . T G 11.46 PASS AB=0.666667;ABP=3.73412;AC=1;AF=0.5;AN=2;AO=2;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=3.0103;EPPR=5.18177;GTI=0;LEN=1;MEANALT=1;MQM=19;MQMR=22;NS=1;NUMALT=1;ODDS=2.55439;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=82;QR=40;RO=1;RPL=1;RPP=3.0103;RPPR=5.18177;RPR=1;RUN=1;SAF=0;SAP=7.35324;SAR=2;SRF=0;SRP=5.18177;SRR=1;TYPE=snp;technology.illumina=1;CSQ=G|downstream_gene_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|||||||||||143|1||SNV|HGNC|HGNC:37102||||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000456328|processed_transcript|3/3||ENST00000456328.2:n.1061N>G||1061/1657|||||||1||SNV|HGNC|HGNC:37102|YES|1||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene|||||||||||591|-1||SNV|HGNC|HGNC:38034|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|ENSG00000278267|Transcript|ENST00000619216|miRNA|||||||||||3556|-1||SNV|HGNC|HGNC:50039|YES|||||||||Ensembl||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|WASH7P|653635|Transcript|NR_024540.1|misc_RNA|||||||||||549|-1||SNV|EntrezGene|HGNC:38034|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|100287102|Transcript|NR_046018.2|misc_RNA|3/3||NR_046018.2:n.1056N>G||1056/1652|||||||1||SNV|EntrezGene|HGNC:37102|YES||||||||rseq_mrna_match|RefSeq||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|MIR6859-1|102466751|Transcript|NR_106918.1|miRNA|||||||||||3556|-1||SNV|EntrezGene|HGNC:50039|YES||||||||rseq_mrna_match|RefSeq|||||||||||||||||||||||||||||||||||| GT:GQ:DP:AD:RO:QR:AO:QA:GL 0/1:0:3:1,2:1:40:2:82:-2.70051,0,-1.29012
So this does not appear to be index dependent, but rather something different happening during parallelization. Will, do you have any clues to help us dig further to resolve?
VEP has code in place specifically to address this (this if you're interested). Basically when the forked process is created the "connection" to the FASTA index needs to be renewed.
This functionality is not specifically tested in the test suite, but I'll write up some tests for it.
Here is what I've done to test manually, using the example file for GRCh38 provided in the VEP repo:
$ ./vep -i examples/homo_sapiens_GRCh38.vcf -force -cache -fasta ~/.vep/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -hgvs -tab -fields HGVSc -pick -o stdout | grep -v "#" | sort -u > a
$ ./vep -i examples/homo_sapiens_GRCh38.vcf -force -cache -fasta ~/.vep/homo_sapiens/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -hgvs -tab -fields HGVSc -pick -o stdout -fork 2 | grep -v "#" | sort -u > b
$ diff a b
$
$ head a
-
ENST00000215781.2:c.318C>G
ENST00000215781.2:c.648G>A
ENST00000215812.8:c.1005T>G
ENST00000215812.8:c.1090C>T
ENST00000215812.8:c.118C>T
ENST00000215812.8:c.133C>T
ENST00000215812.8:c.308T>C
ENST00000215812.8:c.389G>A
ENST00000215812.8:c.528C>T
Obviously you might have to change the path to the FASTA to get it to work in your environment.
Will;
Thanks for the test script and helping debug this. I can confirm that I can replicate the problem here with this input. After running on this minimal test case a bunch of times I take back my hypothesis that it's due to --fork
and will go with Matthias' claim that it's random. Here are 3 consecutive runs of the same script where I modified your test example to count the number of N>
calls. The first two times I see a bunch of them (but different numbers) and the third everything is fine:
$ bash test_vep.sh
172
110
$ bash test_vep.sh
172
72
$ bash test_vep.sh
0
Here's the version we're running:
ensembl : 88.c3093a6
ensembl-funcgen : 88.3e63130
ensembl-io : 88.277fe7c
ensembl-variation : 88.85fee8f
ensembl-vep : 88.10
and the script I ran:
#!/bin/bash
set -eu -o pipefail
unset PERL5LIB
export PATH=/mnt/work/bcbio/anaconda/bin:$PATH
vep | grep ensembl
vep -i homo_sapiens_GRCh38.vcf --species homo_sapiens --merged -force --offline -cache -fasta /mnt/work/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.gz -hgvs -tab -fields HGVSc --dir /mnt/work/bcbio/genomes/Hsapiens/hg38/vep -pick -o stdout | grep -v "#" | sort -u > a
grep -c 'N>' a
vep -i homo_sapiens_GRCh38.vcf --species homo_sapiens --merged -force --offline -cache -fasta /mnt/work/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.gz -hgvs -tab -fields HGVSc --dir /mnt/work/bcbio/genomes/Hsapiens/hg38/vep --fork 2 -pick -o stdout | grep -v "#" | sort -u > b
grep -c 'N>' b
So I'm thoroughly confused now as well. Thanks for any tips/tricks that might help debug this further.
Hi Brad, Will,
Thanks for helping look into this. When Brad said everything worked fine for him, I was starting to think the issue was specific for us. If there's anything I can do to help, please let me know.
M
Finally, tracked it down!
There was a bug in our module for retrieving sequence from the FASTA index; if you ask for the length of an absent chromosome name, it returns -1, not 0 or undef. This means that an incorrect synonym might be used when mapping between the input and FASTA chromosome names, and when you call keys() on a Perl hash the returned order is deliberately random, hence the random failures.
I've patched a fix to the ensembl-variation module where this is found. You'll need to re-run VEP's INSTALL.pl script to pick it up.
Allright! Great to hear! Thanks for tracking down the issue! @chapmanb, I suppose this will require a rebuild of the conda package, so the new perl modules are pulled in? I'll prepare a PR at bioconda.
Cheers M
Update: PR @ bioconda/bioconda-recipes#4881
Brilliant. Thank you Will for identifying the issue and the quick fix. Matthias, thanks for building the package. It looks like the fix went into ensembl-variation
so if I understand correctly a re-install build with an updated bioconda package will grab the latest from release/89
and have the fix in it. Thank you for taking care of that.
Works for HGVS on grch37. Thanks!
Fixed the issue for me on hg38! Thanks everyone!
M
Thanks all for the help, development and testing work. It's great to have bcbio working with the new VEP.
Hi Brad,
We're almost there when it comes to integrating the new vep. I'm still having some issues though. I think those are mainly because we use an uncompressed fasta as reference for the HGVS notations.
I'm getting the following:
[2017-05-24T11:13Z] Ensembl variant effect predictor : GiaB [2017-05-24T11:13Z] gzip: stdout: Broken pipe [2017-05-24T11:29Z] -------------------- EXCEPTION -------------------- [2017-05-24T11:29Z] MSG: [2017-05-24T11:29Z] ERROR: Forked process(es) died [2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::_forked_buffer_to_output /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:362 [2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::next_output_line /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:215 [2017-05-24T11:29Z] STACK Bio::EnsEMBL::VEP::Runner::run /home/galaxy/bcbio-dev/anaconda/share/ensembl-vep-88.9-2/modules/Bio/EnsEMBL/VEP/Runner.pm:117 [2017-05-24T11:29Z] STACK toplevel /home/galaxy/bcbio-dev/anaconda/bin/vep:209 [2017-05-24T11:29Z] Date (localtime) = Wed May 24 11:29:58 2017 [2017-05-24T11:29Z] Ensembl API version = 88 [2017-05-24T11:29Z] ---------------------------------------------------
I have a feeling the
broken pipe
error and the fork timeout are correlated.Also when I run this command outside of bcbio (copied from the commands log):
unset PERL5LIB && export PATH=/home/galaxy/bcbio-dev/anaconda/bin:$PATH && /home/galaxy/bcbio-dev/anaconda/bin/vep --vcf -o stdout -i /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiaB-joint-gatk-haplotype-joint-decompose.vcf.gz --fork 6 --species homo_sapiens --no_stats --cache --offline --dir /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa --plugin LoF,human_ancestor_fa:false --plugin MaxEntScan,/home/galaxy/bcbio-dev/anaconda/share/maxentscan-0_2004.04.21-0 --plugin GeneSplicer,/home/galaxy/bcbio-dev/anaconda/share/genesplicer-1.0-1/genesplicer,/home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/variation/genesplicer --sift b --polyphen b --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /tmp/bcbio/tmpg1yN0o/GiaB-joint-gatk-haplotype-joint-decompose-vepeffects.vcf.gz [2017-05-24T11:13Z] unset PERL5LIB && export PATH=/home/galaxy/bcbio-dev/anaconda/bin:$PATH && /home/galaxy/bcbio-dev/anaconda/bin/vep --vcf -o stdout -i /home/projects/bcbio_annotation/test_runs/bcbio_dev/work/gemini/GiaB-joint-gatk-haplotype-joint-decompose.vcf.gz --fork 6 --species homo_sapiens --no_stats --cache --offline --dir /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/vep --symbol --numbers --biotype --total_length --canonical --gene_phenotype --ccds --uniprot --domains --regulatory --protein --tsl --appris --af --max_af --af_1kg --af_esp --af_exac --pubmed --variant_class --fasta /home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/seq/hg38.fa --plugin LoF,human_ancestor_fa:false --plugin MaxEntScan,/home/galaxy/bcbio-dev/anaconda/share/maxentscan-0_2004.04.21-0 --plugin GeneSplicer,/home/galaxy/bcbio-dev/anaconda/share/genesplicer-1.0-1/genesplicer,/home/galaxy/bcbio-dev/genomes/Hsapiens/hg38/variation/genesplicer --sift b --polyphen b --hgvs --shift_hgvs 1 --merged | sed '/^#/! s/;;/;/g' | bgzip -c > /tmp/bcbio/tmpvU1lI5/GiaB-joint-gatk-haplotype-joint-decompose-vepeffects.vcf.gz
I get the same error timeout issue, but without the
broken pipe
.if I run the exact same command locally, but change the fasta file to an bgzipped and indexed file, everything runs as it should and the exome is processed in about 40 min.
If you have the time, could you run some tests and see if you get the same results?
Thanks M
Hi Matthias, I am running the following command: vep --cache --symbol --input_file $input --output_file $output
But I am getting your broken pipe error: gzip: stdout: Broken pipe
Everything seems to run but takes forever. I am quite new with vep, let me know if you have any idea / tests ?
Thanks a lot Martin
Hi Martin!
I've just annotated 1520 variants with 5 threads in under 2min using bcbio installation and https://github.com/naumenko-sa/bioscripts/blob/master/variants/vep.sh
Make sure your vep installation (--tools) updates matches with data installation (vep cache).
Sergey
Hi Martin!
I've just annotated 1520 variants with 5 threads in under 2min using bcbio installation and https://github.com/naumenko-sa/bioscripts/blob/master/variants/vep.sh
Make sure your vep installation (--tools) updates matches with data installation (vep cache).
Sergey
Thank you for your answer, I changed --cache to --offline and this fixed my gzip: stdout: Broken pipe error
Hi Brad,
We're almost there when it comes to integrating the new vep. I'm still having some issues though. I think those are mainly because we use an uncompressed fasta as reference for the HGVS notations.
I'm getting the following:
I have a feeling the
broken pipe
error and the fork timeout are correlated.Also when I run this command outside of bcbio (copied from the commands log):
I get the same error timeout issue, but without the
broken pipe
.if I run the exact same command locally, but change the fasta file to an bgzipped and indexed file, everything runs as it should and the exome is processed in about 40 min.
If you have the time, could you run some tests and see if you get the same results?
Thanks M