exomiser / svart

API for use in representing and manipulating genomic variation
7 stars 1 forks source link

Issues with DefaultVariant and non-trimmed VCF records #29

Closed ielis closed 3 years ago

ielis commented 3 years ago

I noticed a bug when running svart-graded Squirls benchmark on a VCF file.

When annotating a VCF record:

chr1    225725424       .       CTT     C,CT    258.06  FreqFilter      AC=1,1;AF=0.500,0.500;AN=2;DP=7;ExcessHet=3.0103;FS=0.000;MAX_FREQ=3.1360424;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=29.21;SOR=0.941    GT:AD:DP:GQ:PL  1/2:0,4,3:7:58:275,76,58,106,0,85

Svart threw an exception:

Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Illegal DEL changeLength:-1. Does not match expected -2 given coordinates 1 225725424 225725426 CTT  CT
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
...

An easy way how to replicate the error is to create the following test in the DefaultVariantTest:

@Test
public void test() {
    Contig chr1 = TestContig.of(1, 249_250_621);
    DefaultVariant variant = DefaultVariant.oneBased(chr1, "", 225_725_424, "CTT", "CT");
    // throws an IllegalArgumentException
}

This looks like a problem with allele normalization/trimming and we might need rework the change length checks.

julesjacobsen commented 3 years ago

Exomiser and Jannovar both solve this problem by trimming the input first. Naturally they trim in different directions...

Exomiser shifts left (i.e. trims from the right) as per the VCF/VT spec and Jannovar shifts right as per HGVS recommendations. I recommend using the VCF/VT left shift as this is what's used by the McArthur lab for trimming gnomAD data too.

From Exomiser:

https://github.com/exomiser/Exomiser/blob/master/exomiser-core/src/main/java/org/monarchinitiative/exomiser/core/model/AllelePosition.java#L32-L34

@pnrobinson anything to recommend?

pnrobinson commented 3 years ago

Difficult issue, but probably for most use cases the VCF/VT shift is what we want, so I would vote for that. We should also examine some edge cases in Jannovar and maybe write a few tests?

julesjacobsen commented 3 years ago

Thanks @pnrobinson .

Here's a test case which means thought is required!

in svart the strand is used indicate on which strand the region/variant is located and the coordinates on that strand as opposed to absolute genomic coordinates. So for variants provided on the POSITIVE strand they need to be LEFT shifted and those provided on the NEGATIVE strand need to be RIGHT shifted. For example:

// chr1    225725424       .       CTT     C,CT    258.06  FreqFilter      .    GT:AD:DP:GQ:PL  1/2:0,4,3:7:58:275,76,58,106,0,85
// Given a position on the POSITIVE strand and trimmed as per VCF i.e. shifted to the LEFT
VariantTrimmer.AllelePosition trimmed = VariantTrimmer.leftShift(225725424, "CTT", "CT", VariantTrimmer.retainingCommonBase());
Variant variant = DefaultVariant.oneBased(TestContigs.chr1, trimmed.start(), trimmed.ref(), trimmed.alt());
System.out.println(variant);
System.out.println(variant.toNegativeStrand());

// If the position and sequence is provided on the NEGATIVE strand a RIGHT shift will be needed such that the result can be correctly 'flipped'
VariantTrimmer.AllelePosition trimmedNeg = VariantTrimmer.rightShift(23525196, "AAG", "AG", VariantTrimmer.retainingCommonBase());
Variant variantNeg = DefaultVariant.of(TestContigs.chr1, "", Strand.NEGATIVE, CoordinateSystem.ONE_BASED, Position.of(trimmedNeg.start()), trimmedNeg.ref(), trimmedNeg.alt());
System.out.println(variantNeg);
System.out.println(variantNeg.toOppositeStrand());

outputs

Variant{contig=1, strand=+, coordinateSystem=ONE_BASED, startPosition=225725424, endPosition=225725425, ref='CT', alt='C', variantType=DEL, length=2, changeLength=-1}
Variant{contig=1, strand=-, coordinateSystem=ONE_BASED, startPosition=23525197, endPosition=23525198, ref='AG', alt='G', variantType=DEL, length=2, changeLength=-1}

Variant{contig=1, strand=-, coordinateSystem=ONE_BASED, startPosition=23525197, endPosition=23525198, ref='AG', alt='G', variantType=DEL, length=2, changeLength=-1}
Variant{contig=1, strand=+, coordinateSystem=ONE_BASED, startPosition=225725424, endPosition=225725425, ref='CT', alt='C', variantType=DEL, length=2, changeLength=-1}
julesjacobsen commented 3 years ago

Update - @ielis, @pnrobinson this is something quite crucial which we've overlooked/over-constrained.

Jannovar uses zero-based coordinates, which is fine and we can handle, but crucially it allows empty regions (see - I knew this would come back to haunt us) and empty bases.

Here are two examples where things fall apart - insertions and deletions where the inserted/deleted base has been removed after trimming

VCF/VT trimming strategy (left shift, retain padding)

118887583, TCAAAA, TCAAAACAAAA -> 118887583, T, TCAAAA

HGVS/Jannovar trimming strategy (right shift, remove padding)

118887583, TCAAAA, TCAAAACAAAA -> 118887589, '', CAAAA

The problems really start when trying to convert from zero to one-based - what is a one-based empty base position? Is it simply equivalent to a zero based one?

julesjacobsen commented 3 years ago

Further thoughts on this based on the GFF3 spec.

Columns 4 & 5: "start" and "end"

The start and end coordinates of the feature are given in positive 1-based integer coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature.

For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark.

Using this definition would work - empty zero-based position 99 (left of base) == empty one-based 100 (right of base) so these should be able to be inter-converted, only it's not possible to add the padded base back again for example when going from Jannovar -> Exomiser. This is currently the case and not an issue as I keep the original padded variant.

ielis commented 3 years ago

I think that we are talking about 2 concepts here which we need to separate:

If deletion is the genomic event we want to represent, we might do that as GC>G, or C>.

We might be tempted to think that:

but I believe, that this is only possible because there are some conventions and not because the simplest set of rules would work that out. A shiny example is the statement from GFF3 specs: ... start equals end and the implied site is to the right of the indicated base in the direction of the landmark ....

So, in contrast with the conventions and standards, I believe that it is not possible to represent any genomic event in any coordinate system. I.e., coordinates of an event that span an empty region can be represented neither using 0-based nor 1-based coordinates, but only using coordinate system where both endpoints are open.

I am not sure how do you represent 1 bp long feature in GFF3?

If we cannot create a sexy model without special rules and assumptions, let's at least create a sexy model, and if we cannot avoid having special rules (e.g. to create an empty region in 1-based coordinates, you set start to ... and end to ...), let's encapsulate them into as few places as possible. I think we actually achieved this with the library.

So, with all this, I think we can drop the constraint that prevents us to have empty regions :crossed_fingers: