Clinical-Genomics / scout

VCF visualization interface
https://clinical-genomics.github.io/scout
BSD 3-Clause "New" or "Revised" License
149 stars 45 forks source link

Parse frequencies from VEP #733

Closed moonso closed 6 years ago

moonso commented 6 years ago

VEP 87

GMAF - 1000G all popluations combined xxx_MAF - 1000G (or NHLBI-ESP) individual populations ExAC_MAF - ExAC all popluations ExAC_Adj_MAF- ExAC all popluations (adjusted) ExAC_xxx_MAF - ExAC individual populations http://dec2016.archive.ensembl.org/info/docs/tools/vep/vep_formats.html

VEP 90

AF - 1000G all populations combined xxx_AF - 1000G (or NHLBI-ESP) individual populations gnomAD_AF - gnomAD exomes, all populations combined gnomAD_xxx_AF - gnomAD exomes, individual populations MAX_AF - Max of all populations (1000G, gnomAD exomes, ESP) https://www.ensembl.org/info/docs/tools/vep/vep_formats.html

moonso commented 6 years ago

@PeterARBork @bjhall @dnil I will only support the 90 annotation since the code will be awful if we need to support both. I don't have a larger file with information on how the VEP entries may look like so I will assume the keys above and that the value is only a float. If you have an example VCF with more variants or any other information please inform me.

PeterARBork commented 6 years ago

@moonso you have my support for supporting vep 90 and disregarding the earlier versions.

moonso commented 6 years ago

Fixed in #734

bjhall commented 6 years ago

@moonso

I checked some larger VCFs and unfortunately spotted a problem... All of the *AF fields are strictly floats, except for "AF" (1000G).

In rare cases (always for INDELs as far as I can tell) this field will contain two values separated by an &. For example "0.0791&0.07905". The weird thing is that they appear to almost always be the same value, just with different precision (4 vs 5 decimal places). According to the documentation this should not happen, so I suppose it is a bug in VEP.

I will test if this is fixed in VEP 91, but at the moment this will most likely cause crashes when loading VCFs from VEP 90. I haven't had time to actually test, but float("0.0791&0.07905") will throw a ValueError.

moonso commented 6 years ago

Hmm ok, we should post an issue on VEP:s GitHub. I think I wrote a try catch for this but we would still loose information.

bjhall commented 6 years ago

I reported this to VEP last week. It should be safe to use the first value in these cases. And it should be fixed by version 92: https://github.com/Ensembl/ensembl-vep/issues/143#issuecomment-366748690

moonso commented 6 years ago

Great job!

moonso commented 6 years ago

Btw, did anyone try this?

PeterARBork commented 6 years ago

I just loaded a vcf with only vep annotations, but found no variants in Scout - probably because there's no Gene.refGene field in the vcf. Does that sound about right, since I didn't annotate with annovar or other tools?

Is Scout already supporting gene identifier annotation from vep or could it do so in the future?

moonso commented 6 years ago

Scout parses hgnc ids from VEP. Sounds weird. You can email me a reduced version of the file if you don't get any progress

PeterARBork commented 6 years ago

Here's a few of the lines in the vcf:

##VEP="v90" time="2018-02-22 14:29:25" cache="/home/databases/variant_effect_predictor/.vep/homo_sapiens/90_GRCh38" db="homo_sapiens_core_90_38@ensembldb.ensembl.org" ensembl-variation=90.58bf949 ensembl-io=90.9a148ea ensembl-funcgen=90.743f32b ensembl=90.4a44397 1000genomes="phase3" COSMIC="81" ClinVar="201706" ESP="V2-SSA137" HGMD-PUBLIC="20164" assembly="GRCh38.p10" dbSNP="150" gencode="GENCODE 27" genebuild="2014-07" gnomAD="170228" polyphen="2.2.2" regbuild="16" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Consequence|gnomadAD_AF|AF|MAX_AF|gnomadAD_NFE_AF|EUR_AF|Distance|Codons|Amino_acids|Gene|SYMBOL|CLIN_SIG|Feature|Feature_type|EXON|PolyPhen|SIFT|Protein_position|BIOTYPE">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  57PGF
chr1    17019382        rs12045097      A       G       530.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=2.497;ClippingRankSum=0;DB;DP=38;ExcessHet=3.0103;FS=1.323;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;POSITIVE_TRAIN_SITE;QD=13.97;ReadPosRankSum=0.113;SOR=0.368;VQSLOD=19.34;culprit=MQRankSum;CSQ=intron_variant|||0.5288||0.2097||||ENSG00000117118|SDHB||ENST00000375499|Transcript|||||protein_coding,intron_variant&non_coding_transcript_variant|||0.5288||0.2097||||ENSG00000117118|SDHB||ENST00000475049|Transcript|||||processed_transcript,intron_variant&non_coding_transcript_variant|||0.5288||0.2097||||ENSG00000117118|SDHB||ENST00000485092|Transcript|||||retained_intron,downstream_gene_variant|||0.5288||0.2097||||ENSG00000117118|SDHB||ENST00000485515|Transcript|||||processed_transcript,downstream_gene_variant|||0.5288||0.2097||||ENSG00000117118|SDHB||ENST00000491274|Transcript|||||protein_coding       GT:AD:DP:GQ:PL  0/1:19,19:38:99:559,0,558
chr1    17019909        rs2746476       A       T       1308.77 PASS    AC=2;AF=1;AN=2;DB;DP=39;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;POSITIVE_TRAIN_SITE;QD=33.56;SOR=1.387;VQSLOD=22.35;culprit=MQ;CSQ=intron_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000375499|Transcript|||||protein_coding,intron_variant&non_coding_transcript_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000475049|Transcript|||||processed_transcript,intron_variant&non_coding_transcript_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000485092|Transcript|||||retained_intron,downstream_gene_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000485515|Transcript|||||processed_transcript,downstream_gene_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000491274|Transcript|||||protein_coding GT:AD:DP:GQ:PL  1/1:0,39:39:99:1337,117,0
chr1    17021461        rs2746475       G       A       913.77  PASS    AC=2;AF=1;AN=2;DB;DP=29;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;POSITIVE_TRAIN_SITE;QD=33.84;SOR=1.101;VQSLOD=23.65;culprit=MQ;CSQ=intron_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000375499|Transcript|||||protein_coding,intron_variant&non_coding_transcript_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000475049|Transcript|||||processed_transcript,intron_variant&non_coding_transcript_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000485092|Transcript|||||retained_intron,downstream_gene_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000485515|Transcript|||||processed_transcript,downstream_gene_variant|||0.996||0.9722||||ENSG00000117118|SDHB||ENST00000491274|Transcript|||||protein_coding GT:AD:DP:GQ:PL  1/1:0,27:27:81:942,81,0
chr1    17022049        rs2746474       T       G       712.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-1.742;ClippingRankSum=0;DB;DP=36;ExcessHet=3.0103;FS=8.043;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;POSITIVE_TRAIN_SITE;QD=19.8;ReadPosRankSum=0.761;SOR=2.124;VQSLOD=19.9;culprit=MQRankSum;CSQ=intron_variant|||0.5166||0.1968||||ENSG00000117118|SDHB||ENST00000375499|Transcript|||||protein_coding,intron_variant&non_coding_transcript_variant|||0.5166||0.1968||||ENSG00000117118|SDHB||ENST00000475049|Transcript|||||processed_transcript,intron_variant&non_coding_transcript_variant|||0.5166||0.1968||||ENSG00000117118|SDHB||ENST00000485092|Transcript|||||retained_intron,downstream_gene_variant|||0.5166||0.1968||||ENSG00000117118|SDHB||ENST00000485515|Transcript|||||processed_transcript,downstream_gene_variant|||0.5166||0.1968||||ENSG00000117118|SDHB||ENST00000491274|Transcript|||||protein_coding        GT:AD:DP:GQ:PL  0/1:12,24:36:99:741,0,310
chr1    17022192        rs61769185      C       T       841.77  PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=4.309;ClippingRankSum=0;DB;DP=43;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;POSITIVE_TRAIN_SITE;QD=19.58;ReadPosRankSum=-0.476;SOR=0.855;VQSLOD=19.39;culprit=MQRankSum;CSQ=intron_variant|||0.3033||0.0974||||ENSG00000117118|SDHB||ENST00000375499|Transcript|||||protein_coding,intron_variant&non_coding_transcript_variant|||0.3033||0.0974||||ENSG00000117118|SDHB||ENST00000475049|Transcript|||||processed_transcript,intron_variant&non_coding_transcript_variant|||0.3033||0.0974||||ENSG00000117118|SDHB||ENST00000485092|Transcript|||||retained_intron,downstream_gene_variant|||0.3033||0.0974||||ENSG00000117118|SDHB||ENST00000485515|Transcript|||||processed_transcript,downstream_gene_variant|||0.3033||0.0974||||ENSG00000117118|SDHB||ENST00000491274|Transcript|||||protein_coding  GT:AD:DP:GQ:PL  0/1:15,28:43:99:870,0,319
chr1    17023303        rs978528        C       T       1873.77 PASS    AC=2;AF=1;AN=2;DB;DP=54;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;POSITIVE_TRAIN_SITE;QD=34.7;SOR=0.693;VQSLOD=21.78;culprit=MQ;CSQ=intron_variant|||0.9059||0.8867||||ENSG00000117118|SDHB||ENST00000375499|Transcript|||||protein_coding,downstream_gene_variant|||0.9059||0.8867||||ENSG00000117118|SDHB||ENST00000463045|Transcript|||||protein_coding,upstream_gene_variant|||0.9059||0.8867||||ENSG00000117118|SDHB||ENST00000475049|Transcript|||||processed_transcript,downstream_gene_variant|||0.9059||0.8867||||ENSG00000117118|SDHB||ENST00000475506|Transcript|||||retained_intron,upstream_gene_variant|||0.9059||0.8867||||ENSG00000117118|SDHB||ENST00000485092|Transcript|||||retained_intron,intron_variant&non_coding_transcript_variant|||0.9059||0.8867||||ENSG00000117118|SDHB||ENST00000485515|Transcript|||||processed_transcript,downstream_gene_variant|||0.9059||0.8867||||ENSG00000117118|SDHB||ENST00000491274|Transcript|||||protein_coding     GT:AD:DP:GQ:PL  1/1:0,54:54:99:1902,162,0

Here's the panel:

##panel_id=pb_panel
##institute=bonkolab
##version=1.0
##date=2010-01-02
##display_name=PB Test panel
#hgnc_id        hgnc_symbol     disease_associated_transcripts  reduced_penetrance      genetic_disease_models  mosaicism       database_entry_version  original_hgnc
7481    MT-TF                                           MT-TF
7495    MT-TQ                                           MT-TQ
37102   DDX11L1
14825   OR4F5
10681   SDHB
18674   DDX41

I've been playing with panels also, but in this case I thought the SDHB variants should appear in scout?

bjhall commented 6 years ago

Your VCF is missing a lot of important VEP fields. Consider using --everything

@moonso I will test this on our setup next week! I'm on paternity leave and travelling at the moment...

PeterARBork commented 6 years ago

Alright it works when using --everything.

It would be great with some specifications or an example of how you annotate the vcf's, to make them ready for Scout.

moonso commented 6 years ago

👍 sure @PeterARBork I'll make a new issue for that.