Closed buske closed 9 years ago
Hi Orion, we will take a look at this. Jules is working on exomiser 7 which will have a lot of fixes and also exac data, that might be of use.
Hi Orion,
What version was this run with? 6 or 7? Only 7 has the ExAC data in the database.
Can you send the two variants in VCF format and your settings file so I can run these against the current dev version.
Multi-allelic variants should have all the alleles looked-up and independently filtered, but I think that the old jannovar failed to create two VariantEvaluations for multi-allelic variants. The new one has solved this problem.
Anyway, I'll take a look tomorrow or Monday.
Jules
-----Original Message----- From: "Orion Buske" notifications@github.com Sent: 23/04/2015 20:15 To: "exomiser/Exomiser" Exomiser@noreply.github.com Subject: [Exomiser] Exomiser mis-annotating AF at multi-allelic sites? (#75)
I have a sample with two SNPs being ranked highly after filtering by AF < 1%. The reported MAX_FREQ is 0.0 and no rsid is reported, but they have allele frequencies of 30+% in ExAC. Both sites happen to be multi-allelic in ExAC, and the other SNP is rare, so perhaps Exomiser is erroneously looking up the wrong AF? Unfortunately, AF annotation issues seem to be causing widespread problems in PhenomeCentral. :( Here is the output in variants.tsv:
chr1 120611960 C T 44.0 PASS 0/1 67 MISSENSE NOTCH2:uc001eil.3:exon1:c.61G>A:p.A21T NOTCH2 6.41 0.006 . 0.59 . 0.0 . . . 0.41000003 0.72971 1.0 0.9701324 chr1 120611964 G C 168.0 PASS 0/1 69 MISSENSE NOTCH2:uc001eil.3:exon1:c.57C>G:p.C19W NOTCH2 6.292 . . 0.0 . 0.0 . . . 1.0 0.72971 1.0 0.9701324 http://exac.broadinstitute.org/variant/1-120611960-C-T http://exac.broadinstitute.org/variant/1-120611964-G-C — Reply to this email directly or view it on GitHub.
Thanks! This is Exomiser 7. I was trying to get a dev version running with the indel AF fixes, but wasn't able to build the database. Simply updating the jar and not the database didn't seem to fix things.
Sorry, here are the VCF-formatted variants.
chr1 120611960 . C T 44 . DP=67;VDB=1.508994e-01;RPB=-1.255199e+00;AF1=0.5;AC1=1;DP4=32,22,4,8;MQ=42;FQ=47;PV4=0.12,0.13,0.034,0.26 GT:PL:DP:SP:GQ 0/1:74,0,255:66:9:77
chr1 120611964 . G C 168 . DP=69;VDB=1.063292e-01;RPB=-2.582757e-01;AF1=0.5;AC1=1;DP4=19,26,19,5;MQ=43;FQ=171;PV4=0.0049,0.46,0.041,0.45 GT:PL:DP:SP:GQ 0/1:198,0,255:69:23:99
Yeah, the database is the tricksy bit of the build. Couldn't you just download it from the FTP site? The jars should be easy enough to build off development, although now you'll need to manually install the jannovar jar in your maven repository as it's no longer built by the Charité Husdon server. Either that or clone and build jannovar yourself.
-----Original Message----- From: "Orion Buske" notifications@github.com Sent: 23/04/2015 21:57 To: "exomiser/Exomiser" Exomiser@noreply.github.com Cc: "Jules Jacobsen" jules.jacobsen@gmail.com Subject: Re: [Exomiser] Exomiser mis-annotating AF at multi-allelic sites?(#75)
Thanks! This is Exomiser 7. I was trying to get a dev version running with the indel AF fixes, but wasn't able to build the database. Simply updating the jar and not the database didn't seem to fix things. Sorry, here are the VCF-formatted variants. chr1 120611960 . C T 44 . DP=67;VDB=1.508994e-01;RPB=-1.255199e+00;AF1=0.5;AC1=1;DP4=32,22,4,8;MQ=42;FQ=47;PV4=0.12,0.13,0.034,0.26 GT:PL:DP:SP:GQ 0/1:74,0,255:66:9:77 chr1 120611964 . G C 168 . DP=69;VDB=1.063292e-01;RPB=-2.582757e-01;AF1=0.5;AC1=1;DP4=19,26,19,5;MQ=43;FQ=171;PV4=0.0049,0.46,0.041,0.45 GT:PL:DP:SP:GQ 0/1:198,0,255:69:23:99 — Reply to this email directly or view it on GitHub.
This is the output from the current dev version:
#CHROM POS REF ALT QUAL FILTER GENOTYPE COVERAGE FUNCTIONAL_CLASS HGVS EXOMISER_GENE CADD(>0.483) POLYPHEN(>0.956|>0.446) MUTATIONTASTER(>0.94) SIFT(<0.06) DBSNP_ID MAX_FREQUENCY DBSNP_FREQUENCY EVS_EA_FREQUENCY EVS_AA_FREQUENCY EXAC_AFR_FREQ EXAC_AMR_FREQ EXAC_EAS_FREQ EXAC_FIN_FREQ EXAC_NFE_FREQ EXAC_SAS_FREQ EXAC_OTH_FREQ EXOMISER_VARIANT_SCORE EXOMISER_GENE_PHENO_SCORE EXOMISER_GENE_VARIANT_SCORE EXOMISER_GENE_COMBINED_SCORE
1 120611960 C T 44.0 Frequency 0/1 67 MISSENSE NOTCH2:uc001eik.3:exon1:c.61G>A:p.Ala21Thr NOTCH2 . . . . rs139076095 41.540516 15.469999 . . 41.540516 36.27728 29.13845 34.58122 36.47675 37.693577 33.974358 0.0 0.0 0.0 0.0
1 120611964 G C 168.0 Frequency 0/1 69 MISSENSE NOTCH2:uc001eik.3:exon1:c.57C>G:p.Cys19Trp NOTCH2 . . . . rs11810554 47.154045 31.91 . . 28.370878 43.54528 47.154045 46.629856 46.297115 45.897774 46.448086 0.0 0.0 0.0 0.0
These are pretty high frequency values. So looking in the database:
select * from EXOMISER.FREQUENCY where chromosome = 1 and `position` = 120611964;
CHROMOSOME position REF ALT RSID DBSNPMAF ESPEAMAF ESPAAMAF ESPALLMAF EXACAFRMAF EXACAMRMAF EXACEASMAF EXACFINMAF EXACNFEMAF EXACOTHMAF EXACSASMAF
1 120611964 G A 11810554 0.0 0.0 0.0 0.0 0.0 0.009634 0.0 0.0 0.014753 0.0 0.0
1 120611964 G C 11810554 31.91 0.0 0.0 0.0 28.370878 43.54528 47.154045 46.629856 46.297115 46.448086 45.897774
It's clear that the database is at fault with the frequency values. Dividing the by ExAC values by 100 gets us exactly the right frequency according to the links you gave, also the DbSNP value looks suspiciously high. I'll have a look at the DbSNP/ExAC parsers as I know Damian is a massive fan of unit testing so this will be well covered and likely faultless.
if (exACFreqs.containsKey("AN_AFR") && !exACFreqs.get("AN_AFR").equals("0")) {
float afr = 100f * Integer.parseInt(exACFreqs.get("AC_AFR").split(",")[minorAlleleCounter - 1]) / Integer.parseInt(exACFreqs.get("AN_AFR"));
freq.setExACFrequencyAfr(afr);
}
Hmmm....
I'll clean-up this code and probably switch over to the HTSJDK for parsing the files. It'll take a little while as there are three/four tangled files here parsing the dbSNP, ESP and ExAC frequencies and I really don't want to get this bit wrong. It'll be good conditioning though.
In the meantime you could alter the database values...
--double-check this needs dividing
update FREQUENCY set DBSNPMAF = DBSNPMAF / 100.0;
update FREQUENCY set EXACAFRMAF = EXACAFRMAF / 100.0;
update FREQUENCY set EXACAMRMAF = EXACAMRMAF / 100.0;
update FREQUENCY set EXACEASMAF = EXACEASMAF / 100.0;
update FREQUENCY set EXACFINMAF = EXACFINMAF / 100.0;
update FREQUENCY set EXACNFEMAF = EXACNFEMAF / 100.0;
update FREQUENCY set EXACOTHMAF = EXACOTHMAF / 100.0;
update FREQUENCY set EXACSASMAF = EXACSASMAF / 100.0;
So yeah, it's nothing to do with multi-allelic variants - it's simply that the allele frequencies have been mis-parsed.
Using a 1.0% frequency cut-off and having run the above database script I now get two passed variants:
#CHROM POS REF ALT QUAL FILTER GENOTYPE COVERAGE FUNCTIONAL_CLASS HGVS EXOMISER_GENE CADD(>0.483) POLYPHEN(>0.956|>0.446) MUTATIONTASTER(>0.94) SIFT(<0.06) DBSNP_ID MAX_FREQUENCY DBSNP_FREQUENCY EVS_EA_FREQUENCY EVS_AA_FREQUENCY EXAC_AFR_FREQ EXAC_AMR_FREQ EXAC_EAS_FREQ EXAC_FIN_FREQ EXAC_NFE_FREQ EXAC_SAS_FREQ EXAC_OTH_FREQ EXOMISER_VARIANT_SCORE EXOMISER_GENE_PHENO_SCORE EXOMISER_GENE_VARIANT_SCORE EXOMISER_GENE_COMBINED_SCORE
1 120611960 C T 44.0 PASS 0/1 67 MISSENSE NOTCH2:uc001eik.3:exon1:c.61G>A:p.Ala21Thr NOTCH2 6.41 0.006 . 0.59 rs139076095 0.41540515 0.1547 . . 0.41540515 0.3627728 0.29138452 0.3458122 0.3647675 0.37693578 0.33974358 0.32594067 0.45427817 0.783139 0.20198542
1 120611964 G C 168.0 PASS 0/1 69 MISSENSE NOTCH2:uc001eik.3:exon1:c.57C>G:p.Cys19Trp NOTCH2 6.292 . . 0.0 rs11810554 0.47154045 0.3191 . . 0.28370878 0.4354528 0.47154045 0.46629855 0.46297115 0.45897773 0.46448085 0.783139 0.45427817 0.783139 0.20198542
All good now?
Hey Jules,
It looks like your current dev version is correct, assuming you're intending to display allele frequencies as 0-100% values, since that's what you use for filtering. The variants should be filtered out. The issue I was having is that they weren't being, even when I built the HEAD of develoment. These variants are indeed quite common (allele frequencies around 50%). It's just that the way ExAC, dbSNP, and most databases store and report allele frequencies is a float fraction between 0-1.
Does that clarify things? Sorry for any confusion.
Sigh, looks like I wasn't quite on the head of development when I tested that. It was on a commit a few weeks back. On the latest latest version, this annotation is now giving me the correct AFs, as you showed in https://github.com/exomiser/Exomiser/issues/75#issuecomment-95730782. That said, they are still displayed at 100x, but that's appropriate given that Exomiser filters based on AF at 100x as well.
Gah! Nothing to see here. Orion please do a pull and fresh build before logging a bug - development is a very active branch. Also when you expect an outcome, can you please produce a mock-up of the expected outcome and provide the input VCF and relevant settings in exomiser format, This makes things quicker for me to test as I don't have to translate things from English to Exomish and less error-prone to debug. The given-when-then format is a good one to follow as it is reproducible. This is particularly important when we have an 8+ hour delay in communications.
Slightly confused here. Did this just turn out to be confusion between allele frequencies as a percentage vs 0-1?
On Tue, Apr 28, 2015 at 10:16 AM, Jules Jacobsen notifications@github.com wrote:
Gah! Nothing to see here. Orion please do a pull and fresh build before logging a bug - development is a very active branch. Also when you expect an outcome, can you please produce a mock-up of the expected outcome and provide the input VCF and relevant settings in exomiser format, This makes things quicker for me to test as I don't have to translate things from English to Exomish and less error-prone to debug. The given-when-then format is a good one to follow as it is reproducible. This is particularly important when we have an 8+ hour delay in communications.
— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/75#issuecomment-96984874.
Yeah, more or less - at least that was the confusion on my part. Orion didn't have the latest code so he was confused for different reasons about different things.
Will do, Jules. Sorry, guys. In my defense I think I was on the latest Exomiser code, but hadn't built Jannovar on the latest 0.13-SNAPSHOT, which caused that issue. Sorry for all the noise. Just trying to work back and forth with clinicians trying to analyze all the gene/variant suggestions we're providing for them, and I seem to be losing some things. Not that that justifies it. I'll be more diligent in the future.
No worries - this Jannovar switchover has caught me out as well!
On Tue, Apr 28, 2015 at 8:20 PM, Orion Buske notifications@github.com wrote:
Will do, Jules. Sorry, guys. In my defense I think I was on the latest Exomiser code, but hadn't built Jannovar on the latest 0.13-SNAPSHOT, which caused that issue. Sorry for all the noise. Just trying to work back and forth with clinicians trying to analyze all the gene/variant suggestions we're providing for them, and I seem to be losing some things.
— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/75#issuecomment-97174859.
I have a sample with two SNPs being ranked highly after filtering by AF < 1%. The reported MAX_FREQ is
0.0
and no rsid is reported, but they have allele frequencies of 30+% in ExAC. Both sites happen to be multi-allelic in ExAC, and the other SNP is rare, so perhaps Exomiser is erroneously looking up the wrong AF?Unfortunately, AF annotation issues seem to be causing widespread problems in PhenomeCentral. :(
Here is the output in variants.tsv:
http://exac.broadinstitute.org/variant/1-120611960-C-T http://exac.broadinstitute.org/variant/1-120611964-G-C