bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
996 stars 309 forks source link

Ensembl Variant Effect Predictor annotations support #2150

Open znikasz opened 5 years ago

znikasz commented 5 years ago

I have already been using Adam with snpEff annotations and it worked great. However, when I've tried to use it with a VCF file annotated using Ensembl VEP it didn't work. I was expecting, that ANN format is standardized enough, so that transcriptEffects will be populated.

So I wanted to ask whether I do something wrong and there is a way to make Adam work with VEP or it's not yet supported?

Thank you.

heuermh commented 5 years ago

Hello @znikasz

Ensembl VEP only outputs in ANN format if you use the --vcf_info_field ANN --terms so output options

http://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html#output

This is how we call Ensembl VEP in Cannoli

https://github.com/bigdatagenomics/cannoli/blob/master/core/src/main/scala/org/bdgenomics/cannoli/Vep.scala#L83

znikasz commented 5 years ago

Dear @heuermh ,

thank you for an answer. I checked cannoli way of running VEP and I've tried to do it myself.

I've used a very simple VCF file, containing one change:

chr1    2309937 .       C       CTT     281.04  .       AC=2;AF=1;AN=2   GT:AD:DP:GQ:PL  1/1:0,8:8:24:295,24,0

I've annotated it with VEP, release 96:

vep         --dir /opt/vep/.vep/         -i /data/input.vcf.gz -o /data/output2.vep.vcf --format vcf --vcf --vcf_info_field ANN --terms SO --no_stats --offline

Annotated VCF had lines:

##VEP="v96" time="2019-05-14 17:42:50" cache="/opt/vep/.vep/homo_sapiens/96_GRCh38" ensembl=96.7a35428 ensembl-funcgen=96.9c3a0cd ensembl-variation=96.70d2777 ensembl-io=96.6e65b30 1000genomes="phase3" COSMIC="87" ClinVar="201901" ESP="V2-SSA137" HGMD-PUBLIC="20184" assembly="GRCh38.p12" dbSNP="151" gencode="GENCODE 30" genebuild="2014-07" gnomAD="r2.1" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=ANN,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
(...)
chr1    2309937 .       C       CTT     281.04  .       AC=2;AF=1;AN=2;DP=18;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=59.4;QD=28.73;SOR=1.863;ANN=TT|3_prime_UTR_variant|MODIFIER|SKI|ENSG00000157933|Transcript|ENST00000378536|protein_coding|7/7||||5807-5808|||||||1||HGNC|HGNC:10896       GT:AD:DP:GQ:PL  1/1:0,8:8:24:295,24,0

But now running:

ac.loadVcf("/tmp/output2.vep.vcf").toVariants.dataset.show

throws:

Caused by: java.lang.NumberFormatException: For input string: "5807-5808"
  at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
  at java.lang.Integer.parseInt(Integer.java:580)
  at java.lang.Integer.parseInt(Integer.java:615)
  at org.bdgenomics.adam.converters.TranscriptEffectConverter$.parseFraction(TranscriptEffectConverter.scala:99)
  at org.bdgenomics.adam.converters.TranscriptEffectConverter$.parseTranscriptEffect(TranscriptEffectConverter.scala:143)

as 5807-5808 is not a number.

Is there anything that I'm doing wrong?

heuermh commented 5 years ago

Thank you for the detailed reply. VEP must not be using the same ANN format string as SnpEff or are otherwise different from the ANN specification. I'll take a closer look.