exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
195 stars 54 forks source link

Exomiser incorrectly parsing ExAC and dbSNP allele frequencies #75

Closed buske closed 9 years ago

buske commented 9 years ago

I have a sample with two SNPs being ranked highly after filtering by AF < 1%. The reported MAX_FREQ is 0.0 and no rsid is reported, but they have allele frequencies of 30+% in ExAC. Both sites happen to be multi-allelic in ExAC, and the other SNP is rare, so perhaps Exomiser is erroneously looking up the wrong AF?

Unfortunately, AF annotation issues seem to be causing widespread problems in PhenomeCentral. :(

Here is the output in variants.tsv:

#CHROM  POS REF ALT QUAL    FILTER  GENOTYPE    COVERAGE    FUNCTIONAL_CLASS    HGVS    EXOMISER_GENE   CADD(>0.483)    POLYPHEN(>0.956|>0.446) MUTATIONTASTER(>0.94)   SIFT(<0.06) DBSNP_ID    MAX_FREQUENCY   DBSNP_FREQUENCY EVS_EA_FREQUENCY    EVS_AA_FREQUENCY    EXOMISER_VARIANT_SCORE  EXOMISER_GENE_PHENO_SCORE   EXOMISER_GENE_VARIANT_SCORE EXOMISER_GENE_COMBINED_SCORE
chr1    120611960   C   T   44.0    PASS    0/1 67  MISSENSE    NOTCH2:uc001eil.3:exon1:c.61G>A:p.A21T  NOTCH2  6.41    0.006   .   0.59    .   0.0 .   .   .   0.41000003  0.72971 1.0 0.9701324
chr1    120611964   G   C   168.0   PASS    0/1 69  MISSENSE    NOTCH2:uc001eil.3:exon1:c.57C>G:p.C19W  NOTCH2  6.292   .   .   0.0 .   0.0 .   .   .   1.0 0.72971 1.0 0.9701324

http://exac.broadinstitute.org/variant/1-120611960-C-T http://exac.broadinstitute.org/variant/1-120611964-G-C

pnrobinson commented 9 years ago

Hi Orion, we will take a look at this. Jules is working on exomiser 7 which will have a lot of fixes and also exac data, that might be of use.

julesjacobsen commented 9 years ago

Hi Orion,

What version was this run with? 6 or 7? Only 7 has the ExAC data in the database.

Can you send the two variants in VCF format and your settings file so I can run these against the current dev version.

Multi-allelic variants should have all the alleles looked-up and independently filtered, but I think that the old jannovar failed to create two VariantEvaluations for multi-allelic variants. The new one has solved this problem.

Anyway, I'll take a look tomorrow or Monday.

Jules

-----Original Message----- From: "Orion Buske" notifications@github.com Sent: ‎23/‎04/‎2015 20:15 To: "exomiser/Exomiser" Exomiser@noreply.github.com Subject: [Exomiser] Exomiser mis-annotating AF at multi-allelic sites? (#75)

I have a sample with two SNPs being ranked highly after filtering by AF < 1%. The reported MAX_FREQ is 0.0 and no rsid is reported, but they have allele frequencies of 30+% in ExAC. Both sites happen to be multi-allelic in ExAC, and the other SNP is rare, so perhaps Exomiser is erroneously looking up the wrong AF? Unfortunately, AF annotation issues seem to be causing widespread problems in PhenomeCentral. :( Here is the output in variants.tsv:

CHROM POS REF ALT QUAL FILTER GENOTYPE COVERAGE FUNCTIONAL_CLASS HGVS EXOMISER_GENE CADD(>0.483) POLYPHEN(>0.956|>0.446) MUTATIONTASTER(>0.94) SIFT(<0.06) DBSNP_ID MAX_FREQUENCY DBSNP_FREQUENCY EVS_EA_FREQUENCY EVS_AA_FREQUENCY EXOMISER_VARIANT_SCORE EXOMISER_GENE_PHENO_SCORE EXOMISER_GENE_VARIANT_SCORE EXOMISER_GENE_COMBINED_SCORE

chr1 120611960 C T 44.0 PASS 0/1 67 MISSENSE NOTCH2:uc001eil.3:exon1:c.61G>A:p.A21T NOTCH2 6.41 0.006 . 0.59 . 0.0 . . . 0.41000003 0.72971 1.0 0.9701324 chr1 120611964 G C 168.0 PASS 0/1 69 MISSENSE NOTCH2:uc001eil.3:exon1:c.57C>G:p.C19W NOTCH2 6.292 . . 0.0 . 0.0 . . . 1.0 0.72971 1.0 0.9701324 http://exac.broadinstitute.org/variant/1-120611960-C-T http://exac.broadinstitute.org/variant/1-120611964-G-C — Reply to this email directly or view it on GitHub.

buske commented 9 years ago

Thanks! This is Exomiser 7. I was trying to get a dev version running with the indel AF fixes, but wasn't able to build the database. Simply updating the jar and not the database didn't seem to fix things.

Sorry, here are the VCF-formatted variants.

chr1    120611960   .   C   T   44  .   DP=67;VDB=1.508994e-01;RPB=-1.255199e+00;AF1=0.5;AC1=1;DP4=32,22,4,8;MQ=42;FQ=47;PV4=0.12,0.13,0.034,0.26   GT:PL:DP:SP:GQ  0/1:74,0,255:66:9:77
chr1    120611964   .   G   C   168 .   DP=69;VDB=1.063292e-01;RPB=-2.582757e-01;AF1=0.5;AC1=1;DP4=19,26,19,5;MQ=43;FQ=171;PV4=0.0049,0.46,0.041,0.45   GT:PL:DP:SP:GQ  0/1:198,0,255:69:23:99
julesjacobsen commented 9 years ago

Yeah, the database is the tricksy bit of the build. Couldn't you just download it from the FTP site? The jars should be easy enough to build off development, although now you'll need to manually install the jannovar jar in your maven repository as it's no longer built by the Charité Husdon server. Either that or clone and build jannovar yourself.

-----Original Message----- From: "Orion Buske" notifications@github.com Sent: ‎23/‎04/‎2015 21:57 To: "exomiser/Exomiser" Exomiser@noreply.github.com Cc: "Jules Jacobsen" jules.jacobsen@gmail.com Subject: Re: [Exomiser] Exomiser mis-annotating AF at multi-allelic sites?(#75)

Thanks! This is Exomiser 7. I was trying to get a dev version running with the indel AF fixes, but wasn't able to build the database. Simply updating the jar and not the database didn't seem to fix things. Sorry, here are the VCF-formatted variants. chr1 120611960 . C T 44 . DP=67;VDB=1.508994e-01;RPB=-1.255199e+00;AF1=0.5;AC1=1;DP4=32,22,4,8;MQ=42;FQ=47;PV4=0.12,0.13,0.034,0.26 GT:PL:DP:SP:GQ 0/1:74,0,255:66:9:77 chr1 120611964 . G C 168 . DP=69;VDB=1.063292e-01;RPB=-2.582757e-01;AF1=0.5;AC1=1;DP4=19,26,19,5;MQ=43;FQ=171;PV4=0.0049,0.46,0.041,0.45 GT:PL:DP:SP:GQ 0/1:198,0,255:69:23:99 — Reply to this email directly or view it on GitHub.

julesjacobsen commented 9 years ago

This is the output from the current dev version:

#CHROM  POS REF ALT QUAL    FILTER  GENOTYPE    COVERAGE    FUNCTIONAL_CLASS    HGVS    EXOMISER_GENE   CADD(>0.483)    POLYPHEN(>0.956|>0.446) MUTATIONTASTER(>0.94)   SIFT(<0.06) DBSNP_ID    MAX_FREQUENCY   DBSNP_FREQUENCY EVS_EA_FREQUENCY    EVS_AA_FREQUENCY    EXAC_AFR_FREQ   EXAC_AMR_FREQ   EXAC_EAS_FREQ   EXAC_FIN_FREQ   EXAC_NFE_FREQ   EXAC_SAS_FREQ   EXAC_OTH_FREQ   EXOMISER_VARIANT_SCORE  EXOMISER_GENE_PHENO_SCORE   EXOMISER_GENE_VARIANT_SCORE EXOMISER_GENE_COMBINED_SCORE
1   120611960   C   T   44.0    Frequency   0/1 67  MISSENSE    NOTCH2:uc001eik.3:exon1:c.61G>A:p.Ala21Thr  NOTCH2  .   .   .   .   rs139076095 41.540516   15.469999   .   .   41.540516   36.27728    29.13845    34.58122    36.47675    37.693577   33.974358   0.0 0.0 0.0 0.0
1   120611964   G   C   168.0   Frequency   0/1 69  MISSENSE    NOTCH2:uc001eik.3:exon1:c.57C>G:p.Cys19Trp  NOTCH2  .   .   .   .   rs11810554  47.154045   31.91   .   .   28.370878   43.54528    47.154045   46.629856   46.297115   45.897774   46.448086   0.0 0.0 0.0 0.0

These are pretty high frequency values. So looking in the database:

select * from EXOMISER.FREQUENCY where chromosome = 1 and `position` = 120611964;

CHROMOSOME  position    REF ALT RSID    DBSNPMAF    ESPEAMAF    ESPAAMAF    ESPALLMAF   EXACAFRMAF  EXACAMRMAF  EXACEASMAF  EXACFINMAF  EXACNFEMAF  EXACOTHMAF  EXACSASMAF
1   120611964   G   A   11810554    0.0 0.0 0.0 0.0 0.0 0.009634    0.0 0.0 0.014753    0.0 0.0
1   120611964   G   C   11810554    31.91   0.0 0.0 0.0 28.370878   43.54528    47.154045   46.629856   46.297115   46.448086   45.897774

It's clear that the database is at fault with the frequency values. Dividing the by ExAC values by 100 gets us exactly the right frequency according to the links you gave, also the DbSNP value looks suspiciously high. I'll have a look at the DbSNP/ExAC parsers as I know Damian is a massive fan of unit testing so this will be well covered and likely faultless.

if (exACFreqs.containsKey("AN_AFR") && !exACFreqs.get("AN_AFR").equals("0")) {
   float afr = 100f * Integer.parseInt(exACFreqs.get("AC_AFR").split(",")[minorAlleleCounter - 1]) / Integer.parseInt(exACFreqs.get("AN_AFR"));
    freq.setExACFrequencyAfr(afr);
}

Hmmm....

I'll clean-up this code and probably switch over to the HTSJDK for parsing the files. It'll take a little while as there are three/four tangled files here parsing the dbSNP, ESP and ExAC frequencies and I really don't want to get this bit wrong. It'll be good conditioning though.

In the meantime you could alter the database values...

--double-check this needs dividing
update FREQUENCY set DBSNPMAF = DBSNPMAF / 100.0;

update FREQUENCY set EXACAFRMAF = EXACAFRMAF / 100.0;
update FREQUENCY set EXACAMRMAF = EXACAMRMAF / 100.0;
update FREQUENCY set EXACEASMAF = EXACEASMAF / 100.0;
update FREQUENCY set EXACFINMAF = EXACFINMAF / 100.0;
update FREQUENCY set EXACNFEMAF = EXACNFEMAF / 100.0;
update FREQUENCY set EXACOTHMAF = EXACOTHMAF / 100.0;
update FREQUENCY set EXACSASMAF = EXACSASMAF / 100.0;
julesjacobsen commented 9 years ago

So yeah, it's nothing to do with multi-allelic variants - it's simply that the allele frequencies have been mis-parsed.

julesjacobsen commented 9 years ago

Using a 1.0% frequency cut-off and having run the above database script I now get two passed variants:

#CHROM  POS REF ALT QUAL    FILTER  GENOTYPE    COVERAGE    FUNCTIONAL_CLASS    HGVS    EXOMISER_GENE   CADD(>0.483)    POLYPHEN(>0.956|>0.446) MUTATIONTASTER(>0.94)   SIFT(<0.06) DBSNP_ID    MAX_FREQUENCY   DBSNP_FREQUENCY EVS_EA_FREQUENCY    EVS_AA_FREQUENCY    EXAC_AFR_FREQ   EXAC_AMR_FREQ   EXAC_EAS_FREQ   EXAC_FIN_FREQ   EXAC_NFE_FREQ   EXAC_SAS_FREQ   EXAC_OTH_FREQ   EXOMISER_VARIANT_SCORE  EXOMISER_GENE_PHENO_SCORE   EXOMISER_GENE_VARIANT_SCORE EXOMISER_GENE_COMBINED_SCORE
1   120611960   C   T   44.0    PASS    0/1 67  MISSENSE    NOTCH2:uc001eik.3:exon1:c.61G>A:p.Ala21Thr  NOTCH2  6.41    0.006   .   0.59    rs139076095 0.41540515  0.1547  .   .   0.41540515  0.3627728   0.29138452  0.3458122   0.3647675   0.37693578  0.33974358  0.32594067  0.45427817  0.783139    0.20198542
1   120611964   G   C   168.0   PASS    0/1 69  MISSENSE    NOTCH2:uc001eik.3:exon1:c.57C>G:p.Cys19Trp  NOTCH2  6.292   .   .   0.0 rs11810554  0.47154045  0.3191  .   .   0.28370878  0.4354528   0.47154045  0.46629855  0.46297115  0.45897773  0.46448085  0.783139    0.45427817  0.783139    0.20198542

All good now?

buske commented 9 years ago

Hey Jules,

It looks like your current dev version is correct, assuming you're intending to display allele frequencies as 0-100% values, since that's what you use for filtering. The variants should be filtered out. The issue I was having is that they weren't being, even when I built the HEAD of develoment. These variants are indeed quite common (allele frequencies around 50%). It's just that the way ExAC, dbSNP, and most databases store and report allele frequencies is a float fraction between 0-1.

Does that clarify things? Sorry for any confusion.

buske commented 9 years ago

Sigh, looks like I wasn't quite on the head of development when I tested that. It was on a commit a few weeks back. On the latest latest version, this annotation is now giving me the correct AFs, as you showed in https://github.com/exomiser/Exomiser/issues/75#issuecomment-95730782. That said, they are still displayed at 100x, but that's appropriate given that Exomiser filters based on AF at 100x as well.

julesjacobsen commented 9 years ago

Gah! Nothing to see here. Orion please do a pull and fresh build before logging a bug - development is a very active branch. Also when you expect an outcome, can you please produce a mock-up of the expected outcome and provide the input VCF and relevant settings in exomiser format, This makes things quicker for me to test as I don't have to translate things from English to Exomish and less error-prone to debug. The given-when-then format is a good one to follow as it is reproducible. This is particularly important when we have an 8+ hour delay in communications.

damiansm commented 9 years ago

Slightly confused here. Did this just turn out to be confusion between allele frequencies as a percentage vs 0-1?

On Tue, Apr 28, 2015 at 10:16 AM, Jules Jacobsen notifications@github.com wrote:

Gah! Nothing to see here. Orion please do a pull and fresh build before logging a bug - development is a very active branch. Also when you expect an outcome, can you please produce a mock-up of the expected outcome and provide the input VCF and relevant settings in exomiser format, This makes things quicker for me to test as I don't have to translate things from English to Exomish and less error-prone to debug. The given-when-then format is a good one to follow as it is reproducible. This is particularly important when we have an 8+ hour delay in communications.

— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/75#issuecomment-96984874.

julesjacobsen commented 9 years ago

Yeah, more or less - at least that was the confusion on my part. Orion didn't have the latest code so he was confused for different reasons about different things.

buske commented 9 years ago

Will do, Jules. Sorry, guys. In my defense I think I was on the latest Exomiser code, but hadn't built Jannovar on the latest 0.13-SNAPSHOT, which caused that issue. Sorry for all the noise. Just trying to work back and forth with clinicians trying to analyze all the gene/variant suggestions we're providing for them, and I seem to be losing some things. Not that that justifies it. I'll be more diligent in the future.

damiansm commented 9 years ago

No worries - this Jannovar switchover has caught me out as well!

On Tue, Apr 28, 2015 at 8:20 PM, Orion Buske notifications@github.com wrote:

Will do, Jules. Sorry, guys. In my defense I think I was on the latest Exomiser code, but hadn't built Jannovar on the latest 0.13-SNAPSHOT, which caused that issue. Sorry for all the noise. Just trying to work back and forth with clinicians trying to analyze all the gene/variant suggestions we're providing for them, and I seem to be losing some things.

— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/75#issuecomment-97174859.