Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
451 stars 151 forks source link

Custom annotation with VCF files #23

Closed stefandiederich closed 7 years ago

stefandiederich commented 7 years ago

Good morning,

I use the custom annotation option to annotate my new found variants with information from our in house variant database (allele frequencies etc...). Therefore I export all the variants from our in house database to a vcf file which is looking like this:

#CHROM POS ID REF ALT chr1 12670 0.0151515151515152 G C chr1 13061 0.0151515151515152 G C chr1 13091 0.0151515151515152 G A chr1 13273 0.166666666666667 G C chr1 13302 0.196969696969697 C T

If I now have one position with two or more different variations, for example

chr1 160009163 0.0151515151515152 GACACACACACACACACAC G chr1 160009163 0.0151515151515152 GACACACACACACACAC G chr1 160009163 0.0151515151515152 G GAC

Variant Effect Predictor will now not annotate a found variant at this position. Can I somehow fix this?

Thanks Stefan

willmclaren commented 7 years ago

This should be fixed by https://github.com/Ensembl/ensembl-vep/commit/b9c14f99e5a7bc538089e3a18b581d83c16fea1b

stefandiederich commented 7 years ago

Is this build downloadable already and have you fixed it in the old release of variant_effect_predictor.pl too??


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Montag, 20. Februar 2017 12:53 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

This should be fixed by b9c14f9https://github.com/Ensembl/ensembl-vep/commit/b9c14f99e5a7bc538089e3a18b581d83c16fea1b

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-281060271, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTmBeeB_HpfqB-_vNPRAaz0AJ-Zwpks5reX6RgaJpZM4MF6pz.

willmclaren commented 7 years ago

Yes, the fix is committed to branch release/87 and master. You should be able to get the fix by running git pull in the ensembl-vep directory.

It has not been fixed in the old ensembl-tools/variant_effect_predictor, though I haven't checked if it is also an issue there. We are phasing out support for that version, so unless specifically requested I won't be updating it.

stefandiederich commented 7 years ago

Okay thank you, I will download it.

Because I have not managed to install Bio::DB::BigFile and its dependencies I am using the old version of Variant Effect Predictor in parallel at the moment. It is the same problem there. If it is not to much work it would be great if you could fix it in the old version too.

Thanks Stefan


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797


Von: William McLaren [notifications@github.com] Gesendet: Montag, 20. Februar 2017 13:27 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

Yes, the fix is committed to branch release/87 and master. You should be able to get the fix by running git pull in the ensembl-vep directory.

It has not been fixed in the old ensembl-tools/variant_effect_predictor, though I haven't checked if it is also an issue there. We are phasing out support for that version, so unless specifically requested I won't be updating it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-281066771, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTqk4webiQQ0mXF9DmZ7UpliVYAkSks5reYaxgaJpZM4MF6pz.

willmclaren commented 7 years ago

This should resolve it for the previous version too: https://github.com/Ensembl/ensembl-variation/commit/4a45e24af78e27346c0f9b801af863ac38b9a290

You'll need to re-run INSTALL.pl to pick up the change to the ensembl-variation API module.

stefandiederich commented 7 years ago

That’s great! Thank you!


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Mittwoch, 22. Februar 2017 10:41 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

This should resolve it for the previous version too: Ensembl/ensembl-variation@4a45e24https://github.com/Ensembl/ensembl-variation/commit/4a45e24af78e27346c0f9b801af863ac38b9a290

You'll need to re-run INSTALL.pl to pick up the change to the ensembl-variation API module.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-281618544, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTmdgLQYmEqW34VBTdqnP_GulC0uzks5rfAKggaJpZM4MF6pz.

stefandiederich commented 7 years ago

Hi, i just testet it with the new fixed version but the problem is still occurring. Maybe I am doing something wrong. With the new VEP all my data gets annotated correctly, expect of one entry. But maybe this is due to some other problems. But I did not get the phyloP bigwig annotation file to work with the new Version of VEP because of the necessary Bio::DB::BigFile. Is there any other way to implement a custom annotation with bigwig files? I am now trying to convert the bigwig to a bed file, gzip and tabix it… But this is also not working for the whole file… I am now splitting the file by chromosome.

Another point I saw is, that the new VEP need much longer than the old version. Is the –fork option not available anymore? I do not get any status information from VEP until it is ready…

Kind regards Stefan


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Mittwoch, 22. Februar 2017 10:41 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

This should resolve it for the previous version too: Ensembl/ensembl-variation@4a45e24https://github.com/Ensembl/ensembl-variation/commit/4a45e24af78e27346c0f9b801af863ac38b9a290

You'll need to re-run INSTALL.pl to pick up the change to the ensembl-variation API module.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-281618544, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTmdgLQYmEqW34VBTdqnP_GulC0uzks5rfAKggaJpZM4MF6pz.

willmclaren commented 7 years ago

Can you provide details of the entry that does not get annotated? I'd need to see the input VCF line, as well as the one from the custom VCF that should get added to the input.

Currently the Bio::DB::BigFile route is the only way to add annotations from bigWig files. If enough users have issues installing it we may look into a way to bypass this by using binaries compiled from the kent source tree.

The fork option is still available, and in our tests the new version is significantly faster than the previous. No status output is produced as you've noticed, and as documented in the README. If you can provide evidence with a command that we can reproduce then we can take a look to see if there are specific options or something else that might be causing VEP to run more slowly for you.

stefandiederich commented 7 years ago

Hi,

thanks for reopening the issue. I had a look into the data the last days and found a small bug in my script exporting the data. So now there are all variants annotated with my custom vcf file!

Now I only need to get the phylop score to work.

I also tried out the old version (variant_effect_predictor.pl) after your bugfix, but there is still the problem, that the double entries are not annotated. But as soon as I get the new script running with phylop I will use this one for my annotations.

Concering the speed of vep.pl compared to variant_effect_predictor.pl I use the following command line. Once with the old version it needs less than one minute, the new version needs approximately 10 minutes for the same vcf file.

perl vep.pl --offline --dir '/media/Berechnungen/AnnotationDBs/vep' --fasta '/media/Berechnungen/Referenzgenom/HG19/HG19.karyo.fasta.VEP.fa' --everything --assembly GRCh37 -i '/media/Berechnungen/VEP_test/0823-16.vcf' -o '/media/Berechnungen/VEP_test/0823-16.annot.vcf' --plugin CADD,'/media/Berechnungen/AnnotationDBs/cadd/20160314/whole_genome_SNVs.tsv.gz','/media/Berechnungen/AnnotationDBs/cadd/20160314/cadd_InDels.tsv.gz' -custom '/media/Berechnungen/AnnotationDBs/ihdb/20170208/ihdb_af.vcf.gz',IHDB_AF,vcf,exact,0 -custom '/media/Berechnungen/AnnotationDBs/ihdb/20170208/ihdb_count.vcf.gz',IHDB_Count,vcf,exact,0 --fork 20 --buffer_size 10000 --vcf --pick --pick_order rank,canonical,appris,tsl,biotype,ccds,length --force_overwrite

If you need further information please let me know


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Dienstag, 28. Februar 2017 10:45 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

Can you provide details of the entry that does not get annotated? I'd need to see the input VCF line, as well as the one from the custom VCF that should get added to the input.

Currently the Bio::DB::BigFile route is the only way to add annotations from bigWig files. If enough users have issues installing it we may look into a way to bypass this by using binaries compiled from the kent source tree.

The fork option is still available, and in our tests the new version is significantly faster than the previous. No status output is produced as you've noticed, and as documented in the READMEhttps://github.com/Ensembl/ensembl-vep#vepdiffs. If you can provide evidence with a command that we can reproduce then we can take a look to see if there are specific options or something else that might be causing VEP to run more slowly for you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-282991774, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTo6YI9kzmGGdMsdQZSQIFnF_AlXYks5rg-yDgaJpZM4MF6pz.

willmclaren commented 7 years ago
stefandiederich commented 7 years ago
  1. There are 7367 Variants in the VCF file

  2. It’s an custom enrichment experiment but they are distributed over the whole genome

  3. The VCF is sorted

  4. I removed all –plugin and –custom flags and reduced –fork to 4 => vep.pl runs 16 minutes while variant_effect_predictor.pl runs 1 minute


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Dienstag, 28. Februar 2017 14:50 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-283043719, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTg8TahepDCP3hx_idLijZl0iiIbkks5rhCYfgaJpZM4MF6pz.

willmclaren commented 7 years ago

Can you please do a further comparison without --everything?

stefandiederich commented 7 years ago

I did the test and it is better now. But the old one is still a little bit faster (old 20 sec, new 34 sec). The command I run was:

variant_effect_predictor87.pl/vep.pl --offline --dir '/media/Berechnungen/AnnotationDBs/vep' --fasta '/media/Berechnungen/Referenzgenom/HG19/HG19.karyo.fasta.VEP.fa' --assembly GRCh37 -i '/media/Berechnungen/VEP_test/0823-16.vcf' -o '/media/Berechnungen/VEP_test/0823-16.annot.test.vcf' --fork 4 --buffer_size 10000 --vcf --pick --pick_order rank,canonical,appris,tsl,biotype,ccds,length --force_overwrite

When I run the old vep I get some warnings “Negative repeat count does nothing…” (see picture). Up to now I ignored this because the data looks very good and I did not miss anything. But maybe this has an influence…

[cid:image001.png@01D291DA.20D0F970]


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Dienstag, 28. Februar 2017 15:39 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

Can you please do a further comparison without --everything?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-283056054, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTi7mEAXedrEY4PEnS5t6nSVIYl7Eks5rhDGHgaJpZM4MF6pz.

willmclaren commented 7 years ago

Very odd. In all my tests apart from a few edge cases, ensembl-vep is much faster (usually 20-40%).

There are a few small enhancements brought by some external perl modules which you might like to try, but they should not represent anything like the difference you're seeing:

However, we'd like to make sure this is the case for all our users, so if you'd like to continue to help that would be great. There's a couple of options:

  1. You can send me your input file, by email if required. Obviously this is only OK if your data is not private in any way.
  2. You can run VEP with a profiler, the output from which will show me where the problems lie. You'll need to install Devel::NYTProf; shown here are instructions using cpanm assuming you have write access to Perl's default installation directories:
$ cpanm Devel::NYTProf

If you don't, then you'll need to install somewhere locally, e.g. $HOME/src:

$ cpanm -l $HOME/src Devel::NYTProf
$ export PERL5LIB=${PERL5LIB}:${HOME}/src/lib/perl5/[arch]
$ export PATH=${PATH}:${HOME}/src/bin

[arch] will be some string set according to the architecture of your machine; typically this is something like x86_64-linux-thread-multi-ld; you'd just have to check the contents of that path before running the export command.

Then you can run VEP with the profiler enabled as follows:

$ perl -d:NYTProf vep.pl --offline --dir '/media/Berechnungen/AnnotationDBs/vep' --fasta '/media/Berechnungen/Referenzgenom/HG19/HG19.karyo.fasta.VEP.fa' --assembly GRCh37 -i '/media/Berechnungen/VEP_test/0823-16.vcf' -o '/media/Berechnungen/VEP_test/0823-16.annot.test.vcf' --buffer_size 10000 --vcf --pick --pick_order rank,canonical,appris,tsl,biotype,ccds,length --force_overwrite --everything
$ nytprofhtml
$ tar cfz nytprof.tgz nytprof/

Note you will need to run VEP without forking for the profiling to work.

You'd then have to send me the nytprof.tgz file (or a link to it on e.g. DropBox).

stefandiederich commented 7 years ago

Of cause I will continue to help you. I will send you the vcf file and try to run VEP with the profile. Unfortunately I can’t do this today because I am in meetings this afternoon. As soon as I updated everything and did the run with the profiler I will let you know. If you have anything else in the meantime please let me know.


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Mittwoch, 1. März 2017 10:57 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

Very odd. In all my tests apart from a few edge cases, ensembl-vep is much faster (usually 20-40%).

There are a few small enhancements brought by some external perl modules which you might like to try, but they should not represent anything like the difference you're seeing:

However, we'd like to make sure this is the case for all our users, so if you'd like to continue to help that would be great. There's a couple of options:

  1. You can send me your input file, by email if required. Obviously this is only OK if your data is not private in any way.
  2. You can run VEP with a profiler, the output from which will show me where the problems lie. You'll need to install Devel::NYTProf; shown here are instructions using cpanm assuming you have write access to Perl's default installation directories:

$ cpanm Devel::NYTProf

If you don't, then you'll need to install somewhere locally, e.g. $HOME/src:

$ cpanm -l $HOME/src Devel::NYTProf

$ export PERL5LIB=${PERL5LIB}:${HOME}/src/lib/perl5/[arch]

$ export PATH=${PATH}:${HOME}/src/bin

[arch] will be some string set according to the architecture of your machine; typically this is something like x86_64-linux-thread-multi-ld; you'd just have to check the contents of that path before running the export command.

Then you can run VEP with the profiler enabled as follows:

$ perl -d:NYTProf vep.pl --offline --dir '/media/Berechnungen/AnnotationDBs/vep' --fasta '/media/Berechnungen/Referenzgenom/HG19/HG19.karyo.fasta.VEP.fa' --assembly GRCh37 -i '/media/Berechnungen/VEP_test/0823-16.vcf' -o '/media/Berechnungen/VEP_test/0823-16.annot.test.vcf' --buffer_size 10000 --vcf --pick --pick_order rank,canonical,appris,tsl,biotype,ccds,length --force_overwrite --everything

$ nytprofhtml

$ tar cfz nytprof.tgz nytprof/

Note you will need to run VEP without forking for the profiling to work.

You'd then have to send me the nytprof.tgz file (or a link to it on e.g. DropBox).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-283295924, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTuyxuPD6oRiqT9aTR6p0kyOLvK0Sks5rhUD_gaJpZM4MF6pz.

stefandiederich commented 7 years ago

Hi,

sorry to let you wait so long. I did all the steps you told me in the mail below and uploaded the output of NYTProf to Dropbox (https://dl.dropboxusercontent.com/u/36851205/nytprof.tgz). Hope you can find the information you need. If you need some further things please let me know.

Stefan


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Mittwoch, 1. März 2017 10:57 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

Very odd. In all my tests apart from a few edge cases, ensembl-vep is much faster (usually 20-40%).

There are a few small enhancements brought by some external perl modules which you might like to try, but they should not represent anything like the difference you're seeing:

However, we'd like to make sure this is the case for all our users, so if you'd like to continue to help that would be great. There's a couple of options:

  1. You can send me your input file, by email if required. Obviously this is only OK if your data is not private in any way.
  2. You can run VEP with a profiler, the output from which will show me where the problems lie. You'll need to install Devel::NYTProf; shown here are instructions using cpanm assuming you have write access to Perl's default installation directories:

$ cpanm Devel::NYTProf

If you don't, then you'll need to install somewhere locally, e.g. $HOME/src:

$ cpanm -l $HOME/src Devel::NYTProf

$ export PERL5LIB=${PERL5LIB}:${HOME}/src/lib/perl5/[arch]

$ export PATH=${PATH}:${HOME}/src/bin

[arch] will be some string set according to the architecture of your machine; typically this is something like x86_64-linux-thread-multi-ld; you'd just have to check the contents of that path before running the export command.

Then you can run VEP with the profiler enabled as follows:

$ perl -d:NYTProf vep.pl --offline --dir '/media/Berechnungen/AnnotationDBs/vep' --fasta '/media/Berechnungen/Referenzgenom/HG19/HG19.karyo.fasta.VEP.fa' --assembly GRCh37 -i '/media/Berechnungen/VEP_test/0823-16.vcf' -o '/media/Berechnungen/VEP_test/0823-16.annot.test.vcf' --buffer_size 10000 --vcf --pick --pick_order rank,canonical,appris,tsl,biotype,ccds,length --force_overwrite --everything

$ nytprofhtml

$ tar cfz nytprof.tgz nytprof/

Note you will need to run VEP without forking for the profiling to work.

You'd then have to send me the nytprof.tgz file (or a link to it on e.g. DropBox).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-283295924, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTuyxuPD6oRiqT9aTR6p0kyOLvK0Sks5rhUD_gaJpZM4MF6pz.

willmclaren commented 7 years ago

No problem. It's fairly clear from the profile that most of the runtime (>80%) is being spent fetching known variants from the cache.

You can speed this up hugely by tabix-converting the cache, see http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#convert

Perhaps you did this for the cache you were using with variant_effect_predictor.pl but not for the vep.pl one?

stefandiederich commented 7 years ago

I converted the cache files and see an increase, but the old one is still a little bit faster ;-)

That’s what I did:

  1. Download the cache files for Homo Sapiens GCRh37 version 87 (homo_sapiens_vep_87_GRCh37.tar.gz)

  2. Unzip it to /media/Berechnungen/AnnotationDBs/vep/

  3. Run “perl convert_cache.pl --dir /media/Berechnungen/AnnotationDBs/vep/ --species homo_sapiens --version all”

  4. Then run the old and the new VEP algorithm with the following parameters (same cache files):

Variant_effect_predictor.pl/vep.pl--offline --dir '/media/Berechnungen/AnnotationDBs/vep' --fasta '/media/Berechnungen/Referenzgenom/HG19/HG19.karyo.fasta.VEP.fa' --assembly GRCh37 -i '/media/Berechnungen/VEP_test/0823-16.vcf' -o '/media/Berechnungen/VEP_test/0823-16.annot.2.vcf' --plugin CADD,'/media/Berechnungen/AnnotationDBs/cadd/20160314/whole_genome_SNVs.tsv.gz','/media/Berechnungen/AnnotationDBs/cadd/20160314/cadd_InDels.tsv.gz' -custom '/media/Berechnungen/AnnotationDBs/phylop/20160314/hg19.100way.phyloP100way.sorted.bedgraph.gz',PhyloP,bed,exact,0 -custom '/media/Berechnungen/AnnotationDBs/ihdb/20170208/ihdb_af.vcf.gz',IHDB_AF,vcf,exact,0 -custom '/media/Berechnungen/AnnotationDBs/ihdb/20170208/ihdb_count.vcf.gz',IHDB_Count,vcf,exact,0 --buffer_size 10000 --vcf --pick --pick_order rank,canonical,appris,tsl,biotype,ccds,length --force_overwrite --everything --fork 4/30

I looked at my watch and measured the time needed to process the vcf file.

                                                           Fork 4                   Fork 30

Vep.pl 9:18 min 5:20 min

Variant_effect_predictor.pl 8:18 min 1:23 min

What I saw at a closer look to the vcf file was, that in the new vep the annotations of CADD did not work but in the old they are all there. IS the plugin not compatible with the new vep algorithm?


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Mittwoch, 8. März 2017 10:22 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

No problem. It's fairly clear from the profile that most of the runtime (>80%) is being spent fetching known variants from the cache.

You can speed this up hugely by tabix-converting the cache, see http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#convert

Perhaps you did this for the cache you were using with variant_effect_predictor.pl but not for the vep.pl one?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-284989791, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTqh3LmJZf4gmLNgSLwEUTUEUTnoWks5rjnM3gaJpZM4MF6pz.

willmclaren commented 7 years ago

Without your exact input file I can't say for sure, but my guess is that the tweaks to the forking parameters in vep.pl may not favour "sparse" input files, and instead are tuned to favour denser input files such as those from whole genome sequencing.

CADD should work fine with the new version, and does for me in my tests. Can you try with just CADD and the input file in examples/homo_sapiens_GRCh37.vcf?

stefandiederich commented 7 years ago

Hm okay, I can send you my input file if you like to.

Concerning CADD, it works when I do it with the test data set you mentioned. But if I do it with my real data generated with gatk haplotype caller it does not work. I have uploaded the input file to my dropbox. Maybe you can have a look at it. https://dl.dropboxusercontent.com/u/36851205/0823-16.vcf

I tried it with the full command and although only with the CADD plugin. Strange that it is working with your files but not with mine…


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Mittwoch, 8. März 2017 15:27 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

Without your exact input file I can't say for sure, but my guess is that the tweaks to the forking parameters in vep.pl may not favour "sparse" input files, and instead are tuned to favour denser input files such as those from whole genome sequencing.

CADD should work fine with the new version, and does for me in my tests. Can you try with just CADD and the input file in examples/homo_sapiens_GRCh37.vcf?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-285054456, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTjl12hxRAO5uWhjUR8mqwGSVRUlBks5rjrq1gaJpZM4MF6pz.

willmclaren commented 7 years ago

Thanks for the updated info and the input file.

I've found why CADD wasn't working, this is due to different chromosome naming between your input and the CADD file index. This is fixed by a patch on the ensembl-variation repo (https://github.com/Ensembl/ensembl-variation/commit/d12da6894c174dbdd6731ed42645da15d172cbf4), you can pick it up by re-running INSTALL.pl.

Regarding the speed, it looks like a lot of the differential comes from VEP's handling of custom files when forking. When I remove the --custom flags, I find that the two versions are very similar, with vep.pl very marginally faster. I'll take a look into this in the near future.

stefandiederich commented 7 years ago

Thanks for investigating and solving the problem with CADD. I will pick the newest patch by rerunning the INSTALL.pl. I am looking forward to the new versions in future and till that, I will reduce the usage of the custom option to a minimum.

Thanks again for your help!


Stefan Diederich M. Sc. Bioinformatik

Universitätsmedizin der Johannes Gutenberg-Universität Mainz Langenbeckstraße 1, 55131 Mainz Tel.: 06131 17-5797

Von: William McLaren [mailto:notifications@github.com] Gesendet: Freitag, 10. März 2017 13:02 An: Ensembl/ensembl-vep Cc: Diederich, Stefan; Author Betreff: Re: [Ensembl/ensembl-vep] Custom annotation with VCF files (#23)

Thanks for the updated info and the input file.

I've found why CADD wasn't working, this is due to different chromosome naming between your input and the CADD file index. This is fixed by a patch the ensembl-variation repo (Ensembl/ensembl-variation@d12da68https://github.com/Ensembl/ensembl-variation/commit/d12da6894c174dbdd6731ed42645da15d172cbf4), you can pick it up by re-running INSTALL.pl.

Regarding the speed, it looks like a lot of the differential comes from VEP's handling of custom files when forking. When I remove the --custom flags, I find that the two versions are very similar, with vep.pl very marginally faster. I'll take a look into this in the near future.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/Ensembl/ensembl-vep/issues/23#issuecomment-285652231, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYoGTmOuZdUj2O9SAFXSLnSUCsiZZctDks5rkTu5gaJpZM4MF6pz.