Exomiser parsing of indels is incorrect?

buske commented 9 years ago

It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.

For example, the VCF file contains: chr11 61165731 . C CA

This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.6116573_2_

The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%

Not sure how common this is? Can anyone confirm?

visze commented 9 years ago

I think it will be the best, if the Exomiser prints out the variant like in the initial VCF format. It makes no sense to remove the reference allele (and this leads to position errors, as Orion noticed).

The frequency of rs11382548 was added in dbSNP build 142 (snp present since version 120). So maybe Exomiser has not the newest dbSNP version.

Am 03.02.2015 um 16:09 schrieb Orion Buske notifications@github.com:

It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.

For example, the VCF file contains: chr11 61165731 . C CA

This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732

The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%

Not sure how common this is? Can anyone confirm?

— Reply to this email directly or view it on GitHub.

visze commented 9 years ago

To be correct:

it is an insertion of A between 61165731 and 61165732 of the reference sequence. Therefore also g.61165732 is incorrect, because the position is related to the reference sequence.

buske commented 9 years ago

I would actually tend to suggest using the stripped version for output and all database lookups as it is unambiguous. Otherwise, the same variant can occur with different numbers of prefix letters (and different positions), leading to further database lookup errors.

buske commented 9 years ago

@visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.

damiansm commented 9 years ago

Not entirely sure what is happening here but the db has the data:

nsfpalizer=# select * from frequency where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf | espaamaf | espallmaf ------------+----------+-----+-----+----------+----------+----------+-----------+----------- 11 | 61165731 | C | CA | 11382548 | | 14.11 | 23.857599 | 17.2379

What version of Exomiser are you using Orion? I know Jannovar always chooses to strip off the ref allele - there was a long debate a while back about whether this was right or not. And at some point the code was fine i.e. it knew what Jannovar was doing and still managed to look up the indels.

This definitely needs investigating

On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske notifications@github.com wrote:

@visze https://github.com/visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.

— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72669605.

damiansm commented 9 years ago

and in the variant table that stores pathogenicity data it is:

nsfpalizer=# select * from variant where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | aaref | aaalt | aapos | sift | polyphen | mut_taster | phylop | cadd | cadd_raw ------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+---------- 11 | 61165731 | - | A | | | 0 | -5 | -5 | -5 | -5 | 20.3 | 3.973177

I have a feeling we used to consistently store the position and alleles the same way Jannovar does but at some point the frequency table has got broken for indels?

On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley damiansmedley@gmail.com wrote:

Not entirely sure what is happening here but the db has the data:

nsfpalizer=# select * from frequency where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf | espaamaf | espallmaf

------------+----------+-----+-----+----------+----------+----------+-----------+----------- 11 | 61165731 | C | CA | 11382548 | | 14.11 | 23.857599 | 17.2379

What version of Exomiser are you using Orion? I know Jannovar always chooses to strip off the ref allele - there was a long debate a while back about whether this was right or not. And at some point the code was fine i.e. it knew what Jannovar was doing and still managed to look up the indels.

This definitely needs investigating

On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske notifications@github.com wrote:

@visze https://github.com/visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.

— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72669605.

holtgrewe commented 9 years ago

The variant representation in Jannovar has changed in my updated versions. I was planning to integrate the new version with the Exomiser in this month.

If the fix is not easy it might be simpler to just integrate the new Jannovar version and make sure that works properly.

damiansm commented 9 years ago

I think the problem may be in VCF2FrequencyParser.transformVCF2AnnovarCoordinates.

It looks as though nothing is ever done with teh reset positions and ref and alt alleles

On Tue, Feb 3, 2015 at 3:40 PM, Damian Smedley damiansmedley@gmail.com wrote:

and in the variant table that stores pathogenicity data it is:

nsfpalizer=# select * from variant where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | aaref | aaalt | aapos | sift | polyphen | mut_taster | phylop | cadd | cadd_raw

------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+---------- 11 | 61165731 | - | A | | | 0 | -5 | -5 | -5 | -5 | 20.3 | 3.973177

I have a feeling we used to consistently store the position and alleles the same way Jannovar does but at some point the frequency table has got broken for indels?

On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley damiansmedley@gmail.com wrote:

Not entirely sure what is happening here but the db has the data:

nsfpalizer=# select * from frequency where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf | espaamaf | espallmaf

------------+----------+-----+-----+----------+----------+----------+-----------+----------- 11 | 61165731 | C | CA | 11382548 | | 14.11 | 23.857599 | 17.2379

What version of Exomiser are you using Orion? I know Jannovar always chooses to strip off the ref allele - there was a long debate a while back about whether this was right or not. And at some point the code was fine i.e. it knew what Jannovar was doing and still managed to look up the indels.

This definitely needs investigating

On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske notifications@github.com wrote:

@visze https://github.com/visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.

— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72669605.

damiansm commented 9 years ago

Does that mean it would be represented as chr11:g.61165731C->CA in the new version of Jannovar and we would no longer need to do any conversion of coordinates/alleles when we generate the database?

On Tue, Feb 3, 2015 at 3:46 PM, Manuel Holtgrewe notifications@github.com wrote:

The variant representation in Jannovar has changed in my updated versions. I was planning to integrate the new version with the Exomiser in this month.

If the fix is not easy it might be simpler to just integrate the new Jannovar version and make sure that works properly.

— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72673554.

holtgrewe commented 9 years ago

Jannovar uses the HTSJDK internally, i.e. you can see the VCF file directly if you want to.

Each alternative allele of the HTSJDK VariantContext is converted into a Jannovar GenomeChange object. These objects describe changes in the sequence using zero-based coordinates, in pseudo-code:

GenomeChange(1, 123, "A", "C") describes a SNV of the 124-th base (in 1-based notation) on chromosome 1.
GenomeChange(1, 234, "", "CGT") describes the insertion of the string CGT before the 234-th base (1-based).
GenomeChange(1, 456, "CGT", "") describes the deletion of the 457th to the 459th base (1-based).

Of course, any conversion can be made on an update.

pnrobinson commented 9 years ago

Thanks for pointing this out, Orion. Another thing that I noticed recently that we need to do better is when there is a deletion or insertion right in the splice donor or acceptor site. Jannovar currently annotated this as c.123->C, for instance, which is incorrect. Manuel, can we discuss this? thanks Peter

Von: Orion Buske [notifications@github.com] Gesendet: Dienstag, 3. Februar 2015 16:09 An: exomiser/Exomiser Betreff: [Exomiser] Exomiser parsing of indels is incorrect? (#36)

It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.

For example, the VCF file contains: chr11 61165731 . C CA

This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732

The output lists there as being no frequency data, but this is actually rs11382548http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs11382548 with MAF 14%

Not sure how common this is? Can anyone confirm?

— Reply to this email directly or view it on GitHubhttps://github.com/exomiser/Exomiser/issues/36.

holtgrewe commented 9 years ago

@buske the current Jannovar version (upcoming v0.13 definitely, v0.11 should give the same HGVS annotation albeit with the old effect names) yields the following:

# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61165732>A'
[...]
#change effect  hgvs_annotation
chr1:61165732>A NON_CODING_TRANSCRIPT_INTRON_VARIANT    AK097193:uc001czt.1:n.463+29515_463+29516insT

@pnrobinson the current Jannovar version should handle this correctly:

# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61127145C>A' 'chr1:61127145>A'
[...]
#change effect  hgvs_annotation
chr1:61127145C>A    SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT  AK097193:uc001czt.1:n.505-3G>T
chr1:61127145>A SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT  AK097193:uc001czt.1:n.505-3_505-2insT

buske commented 9 years ago

@damiansm, this happened on both the 6.0.0 binary downloaded directly, and the latest build of our phenomecentral branch (built off of development commit: adb7bb0d).

buske commented 9 years ago

Here are a few more examples we ran into. All of these appear are technically "fixed" in Jannovar v0.12 because it does not appear to do prefix/suffix stripping. If Exomiser continues to do variant normalization, I'd suggest doing something like: http://genome.sph.umich.edu/wiki/Variant_Normalization.

The variant is improperly reduced, resulting in both an incorrect position (+2 instead of +1) and an incorrect alt allele (TCCGCCGCC instead of CCTCCGCCG).

VCF:

chr1    120612040       .       Tcc     TCCTCCGCCGcc    217     .       INDEL;DP=58;VDB=0.0430 AF1=0.5;AC1=1;DP4=11,13,10,13;MQ=43;FQ=217;PV4=1,1,0.18,1      GT:PL:DP:SP:GQ  0/1:255,0,255:47:0:99

exomiser.variant.tsv:

chr1    120612042    -    TCCGCCGCC    217.0    Pathogenicity    0/1    58    UTR5    NOTCH2:uc001eil.3:c.-22_-21insGGCGGCGGA    NOTCH2    .    ..    .    .    0.0    .    .    .    0.0    0.6022212    1.0    0.8961792

The variant is not left-aligned, resulting in an incorrect position (+5 instead of +1)

VCF:

chr6    31242257        .       accccc  acccc   9.9     .       INDEL;DP=3;VDB=0.0191;AF1=1;AC1=2;DP4=0,0,0,3;MQ=60;FQ=-43.5    GT:PL:DP:SP:GQ  1/1:49,9,0:3:0:12

exomiser.variant.tsv:

chr6    31242262    C    -    9.9    Target    1/1    3    INTRONIC    HLA-B:uc003ntf.2:intron2:c.344-3137G>-    HLA-B    .    .    ..    .    .    .    .    .    0.0    0.5999689    1.0    0.8939807

holtgrewe commented 9 years ago

Hi @buske, thanks for the link.

The different representation of variants in the input VCF and the output TSV/VCF in Exomiser is something that Jules and I already talked about. Our proposal is to write out the same REF and ALT fields as is in the input VCF file.

The Exomiser and Jannovar do not know about the reference sequence (IIRC for the Exomiser) and thus proper normalization is currently not easily possible. However, this could be added later since the normalization is trivial given the proof and algorithm in your link.

Within the Exomiser, variant trimming is done by Jannovar that first does trimming (not following the algorithm in your link) and then 3' alignment for each transcript as required for the HGVS output. This is based on the mRNA sequence.

Do you think the following is a sensible approach for now?

the Exomiser itself is agnostic about variant normalization by default for now in its input and output and for the database lookup and requires normalized VCF files
Jannovar continues to perform its variant normalization for now based on the mRNA and in 3' direction but uses the algorithm in your link
we provide a little precompiled Java standalone tool for variant normalization if people know the pain about installing

buske commented 9 years ago

@holtgrewe Sorry for the slow response. Yes, I think that sounds good. Honestly, however you handle it is fine with me, I just wanted to make sure it was all being discussed so we don't run into erroneous results downstream.

I believe we would be very interested in a little standalone variant normalization tool, as well. :)

damiansm commented 9 years ago

Current status on 7.0.0.BETA with Jannovar 0.13 incorporated. (i) Seems Exomiser just outputs what was input i.e. the 11:61165731 C CA variant comes out as that. (ii) The internal representation within Exomiser/Jannovar seems to be pos=61165731, ref="-", alt="A"

Therefore it seems to me that the indels in the frequency table should be normalized before loading into the database so they are correctly loaded. This used to happen but at some stage got broken. I will fix this and then hopefully we are there.

The only gotcha is my normalization should correspond to Jannovar normalization.

damiansm commented 9 years ago

I have now tested this fix in 7.0.0.BETA and all seems fine

buske commented 9 years ago

Here's another variant for testing:

On Exomiser 6.0 this variant gets shifted to a point where it's called as novel, despite the allele frequency of around 30%.

Here's the line from the VCF file:

chr15   100252709       .       Ccagcagcagcagcagcagcagcagcagcagcagc     Ccagcagcagcagcagcagcagcagcagc   999     .       INDEL;DP=521;VDB=0.0390;AF1=0.25;AC1=1;DP4=66,106,47,95;MQ=50;FQ=999;PV4=0.35,1,1,1     GT:PL:GQ        0/1:255,0,255:99

And the line from the Exomiser 6.0 output tsv file:

chr15   100252738       AGCAGC  -       999.0   PASS    0/1     521     NON_FS_DELETION MEF2A:uc010bot.3:exon8:c.1052_1057del:p.Q351_Q352del

Which is the same as this variant: http://exac.broadinstitute.org/variant/15-100252709-CCAGCAG-C

exomiser / Exomiser

Exomiser parsing of indels is incorrect? #36