Closed buske closed 9 years ago
I think it will be the best, if the Exomiser prints out the variant like in the initial VCF format. It makes no sense to remove the reference allele (and this leads to position errors, as Orion noticed).
The frequency of rs11382548 was added in dbSNP build 142 (snp present since version 120). So maybe Exomiser has not the newest dbSNP version.
Am 03.02.2015 um 16:09 schrieb Orion Buske notifications@github.com:
It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.
For example, the VCF file contains: chr11 61165731 . C CA
This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732
The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%
Not sure how common this is? Can anyone confirm?
— Reply to this email directly or view it on GitHub.
To be correct:
it is an insertion of A between 61165731 and 61165732 of the reference sequence. Therefore also g.61165732 is incorrect, because the position is related to the reference sequence.
I would actually tend to suggest using the stripped version for output and all database lookups as it is unambiguous. Otherwise, the same variant can occur with different numbers of prefix letters (and different positions), leading to further database lookup errors.
@visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.
Not entirely sure what is happening here but the db has the data:
nsfpalizer=# select * from frequency where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf | espaamaf | espallmaf ------------+----------+-----+-----+----------+----------+----------+-----------+----------- 11 | 61165731 | C | CA | 11382548 | | 14.11 | 23.857599 | 17.2379
What version of Exomiser are you using Orion? I know Jannovar always chooses to strip off the ref allele - there was a long debate a while back about whether this was right or not. And at some point the code was fine i.e. it knew what Jannovar was doing and still managed to look up the indels.
This definitely needs investigating
On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske notifications@github.com wrote:
@visze https://github.com/visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.
— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72669605.
and in the variant table that stores pathogenicity data it is:
nsfpalizer=# select * from variant where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | aaref | aaalt | aapos | sift | polyphen | mut_taster | phylop | cadd | cadd_raw ------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+---------- 11 | 61165731 | - | A | | | 0 | -5 | -5 | -5 | -5 | 20.3 | 3.973177
I have a feeling we used to consistently store the position and alleles the same way Jannovar does but at some point the frequency table has got broken for indels?
On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley damiansmedley@gmail.com wrote:
Not entirely sure what is happening here but the db has the data:
nsfpalizer=# select * from frequency where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf | espaamaf | espallmaf
------------+----------+-----+-----+----------+----------+----------+-----------+----------- 11 | 61165731 | C | CA | 11382548 | | 14.11 | 23.857599 | 17.2379
What version of Exomiser are you using Orion? I know Jannovar always chooses to strip off the ref allele - there was a long debate a while back about whether this was right or not. And at some point the code was fine i.e. it knew what Jannovar was doing and still managed to look up the indels.
This definitely needs investigating
On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske notifications@github.com wrote:
@visze https://github.com/visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.
— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72669605.
The variant representation in Jannovar has changed in my updated versions. I was planning to integrate the new version with the Exomiser in this month.
If the fix is not easy it might be simpler to just integrate the new Jannovar version and make sure that works properly.
I think the problem may be in VCF2FrequencyParser.transformVCF2AnnovarCoordinates.
It looks as though nothing is ever done with teh reset positions and ref and alt alleles
On Tue, Feb 3, 2015 at 3:40 PM, Damian Smedley damiansmedley@gmail.com wrote:
and in the variant table that stores pathogenicity data it is:
nsfpalizer=# select * from variant where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | aaref | aaalt | aapos | sift | polyphen | mut_taster | phylop | cadd | cadd_raw
------------+----------+-----+-----+-------+-------+-------+------+----------+------------+--------+-------+---------- 11 | 61165731 | - | A | | | 0 | -5 | -5 | -5 | -5 | 20.3 | 3.973177
I have a feeling we used to consistently store the position and alleles the same way Jannovar does but at some point the frequency table has got broken for indels?
On Tue, Feb 3, 2015 at 3:35 PM, Damian Smedley damiansmedley@gmail.com wrote:
Not entirely sure what is happening here but the db has the data:
nsfpalizer=# select * from frequency where chromosome = 11 and position = 61165731; chromosome | position | ref | alt | rsid | dbsnpmaf | espeamaf | espaamaf | espallmaf
------------+----------+-----+-----+----------+----------+----------+-----------+----------- 11 | 61165731 | C | CA | 11382548 | | 14.11 | 23.857599 | 17.2379
What version of Exomiser are you using Orion? I know Jannovar always chooses to strip off the ref allele - there was a long debate a while back about whether this was right or not. And at some point the code was fine i.e. it knew what Jannovar was doing and still managed to look up the indels.
This definitely needs investigating
On Tue, Feb 3, 2015 at 3:26 PM, Orion Buske notifications@github.com wrote:
@visze https://github.com/visze Okay, fair point on the position. So it sounds like it's not getting looked up because of the dbSNP version, and there is no underlying cause for concerns.
— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72669605.
Does that mean it would be represented as chr11:g.61165731C->CA in the new version of Jannovar and we would no longer need to do any conversion of coordinates/alleles when we generate the database?
On Tue, Feb 3, 2015 at 3:46 PM, Manuel Holtgrewe notifications@github.com wrote:
The variant representation in Jannovar has changed in my updated versions. I was planning to integrate the new version with the Exomiser in this month.
If the fix is not easy it might be simpler to just integrate the new Jannovar version and make sure that works properly.
— Reply to this email directly or view it on GitHub https://github.com/exomiser/Exomiser/issues/36#issuecomment-72673554.
Jannovar uses the HTSJDK internally, i.e. you can see the VCF file directly if you want to.
Each alternative allele of the HTSJDK VariantContext
is converted into a Jannovar GenomeChange
object. These objects describe changes in the sequence using zero-based coordinates, in pseudo-code:
GenomeChange(1, 123, "A", "C")
describes a SNV of the 124-th base (in 1-based notation) on chromosome 1.GenomeChange(1, 234, "", "CGT")
describes the insertion of the string CGT
before the 234-th base (1-based).GenomeChange(1, 456, "CGT", "")
describes the deletion of the 457th to the 459th base (1-based).Of course, any conversion can be made on an update.
Thanks for pointing this out, Orion. Another thing that I noticed recently that we need to do better is when there is a deletion or insertion right in the splice donor or acceptor site. Jannovar currently annotated this as c.123->C, for instance, which is incorrect. Manuel, can we discuss this? thanks Peter
Von: Orion Buske [notifications@github.com] Gesendet: Dienstag, 3. Februar 2015 16:09 An: exomiser/Exomiser Betreff: [Exomiser] Exomiser parsing of indels is incorrect? (#36)
It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.
For example, the VCF file contains: chr11 61165731 . C CA
This results in the following annotation in the exomiser output: chr11:g.61165731->A, which is incorrect. It should be g.61165732
The output lists there as being no frequency data, but this is actually rs11382548http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs11382548 with MAF 14%
Not sure how common this is? Can anyone confirm?
— Reply to this email directly or view it on GitHubhttps://github.com/exomiser/Exomiser/issues/36.
@buske the current Jannovar version (upcoming v0.13
definitely, v0.11
should give the same HGVS annotation albeit with the old effect names) yields the following:
# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61165732>A'
[...]
#change effect hgvs_annotation
chr1:61165732>A NON_CODING_TRANSCRIPT_INTRON_VARIANT AK097193:uc001czt.1:n.463+29515_463+29516insT
@pnrobinson the current Jannovar version should handle this correctly:
# java -jar jannovar-cli/target/jannovar-cli-0.13-SNAPSHOT.jar annotate-pos data/hg19_ucsc.ser 'chr1:61127145C>A' 'chr1:61127145>A'
[...]
#change effect hgvs_annotation
chr1:61127145C>A SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT AK097193:uc001czt.1:n.505-3G>T
chr1:61127145>A SPLICE_REGION_VARIANT+NON_CODING_TRANSCRIPT_INTRON_VARIANT AK097193:uc001czt.1:n.505-3_505-2insT
@damiansm, this happened on both the 6.0.0 binary downloaded directly, and the latest build of our phenomecentral branch (built off of development commit: adb7bb0d).
Here are a few more examples we ran into. All of these appear are technically "fixed" in Jannovar v0.12 because it does not appear to do prefix/suffix stripping. If Exomiser continues to do variant normalization, I'd suggest doing something like: http://genome.sph.umich.edu/wiki/Variant_Normalization.
The variant is improperly reduced, resulting in both an incorrect position (+2 instead of +1) and an incorrect alt allele (TCCGCCGCC instead of CCTCCGCCG).
chr1 120612040 . Tcc TCCTCCGCCGcc 217 . INDEL;DP=58;VDB=0.0430 AF1=0.5;AC1=1;DP4=11,13,10,13;MQ=43;FQ=217;PV4=1,1,0.18,1 GT:PL:DP:SP:GQ 0/1:255,0,255:47:0:99
chr1 120612042 - TCCGCCGCC 217.0 Pathogenicity 0/1 58 UTR5 NOTCH2:uc001eil.3:c.-22_-21insGGCGGCGGA NOTCH2 . .. . . 0.0 . . . 0.0 0.6022212 1.0 0.8961792
The variant is not left-aligned, resulting in an incorrect position (+5 instead of +1)
chr6 31242257 . accccc acccc 9.9 . INDEL;DP=3;VDB=0.0191;AF1=1;AC1=2;DP4=0,0,0,3;MQ=60;FQ=-43.5 GT:PL:DP:SP:GQ 1/1:49,9,0:3:0:12
chr6 31242262 C - 9.9 Target 1/1 3 INTRONIC HLA-B:uc003ntf.2:intron2:c.344-3137G>- HLA-B . . .. . . . . . 0.0 0.5999689 1.0 0.8939807
Hi @buske, thanks for the link.
The different representation of variants in the input VCF and the output TSV/VCF in Exomiser is something that Jules and I already talked about. Our proposal is to write out the same REF and ALT fields as is in the input VCF file.
The Exomiser and Jannovar do not know about the reference sequence (IIRC for the Exomiser) and thus proper normalization is currently not easily possible. However, this could be added later since the normalization is trivial given the proof and algorithm in your link.
Within the Exomiser, variant trimming is done by Jannovar that first does trimming (not following the algorithm in your link) and then 3' alignment for each transcript as required for the HGVS output. This is based on the mRNA sequence.
Do you think the following is a sensible approach for now?
@holtgrewe Sorry for the slow response. Yes, I think that sounds good. Honestly, however you handle it is fine with me, I just wanted to make sure it was all being discussed so we don't run into erroneous results downstream.
I believe we would be very interested in a little standalone variant normalization tool, as well. :)
Current status on 7.0.0.BETA with Jannovar 0.13 incorporated. (i) Seems Exomiser just outputs what was input i.e. the 11:61165731 C CA variant comes out as that. (ii) The internal representation within Exomiser/Jannovar seems to be pos=61165731, ref="-", alt="A"
Therefore it seems to me that the indels in the frequency table should be normalized before loading into the database so they are correctly loaded. This used to happen but at some stage got broken. I will fix this and then hopefully we are there.
The only gotcha is my normalization should correspond to Jannovar normalization.
I have now tested this fix in 7.0.0.BETA and all seems fine
Here's another variant for testing:
On Exomiser 6.0 this variant gets shifted to a point where it's called as novel, despite the allele frequency of around 30%.
Here's the line from the VCF file:
chr15 100252709 . Ccagcagcagcagcagcagcagcagcagcagcagc Ccagcagcagcagcagcagcagcagcagc 999 . INDEL;DP=521;VDB=0.0390;AF1=0.25;AC1=1;DP4=66,106,47,95;MQ=50;FQ=999;PV4=0.35,1,1,1 GT:PL:GQ 0/1:255,0,255:99
And the line from the Exomiser 6.0 output tsv file:
chr15 100252738 AGCAGC - 999.0 PASS 0/1 521 NON_FS_DELETION MEF2A:uc010bot.3:exon8:c.1052_1057del:p.Q351_Q352del
Which is the same as this variant: http://exac.broadinstitute.org/variant/15-100252709-CCAGCAG-C
It seems that indels are not getting parsed correctly, resulting in the AF and dbSNP lookups to fail.
For example, the VCF file contains:
chr11 61165731 . C CA
This results in the following annotation in the exomiser output:
chr11:g.61165731->A
, which is incorrect. It should be g.6116573_2_The output lists there as being no frequency data, but this is actually rs11382548 with MAF 14%
Not sure how common this is? Can anyone confirm?