ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
80 stars 32 forks source link

Problems with normalisation of unphased variants #505

Closed d-cameron closed 1 month ago

d-cameron commented 3 months ago

If I have two nearby unphased indel variants it is unclear how they are supposed to be normalised.

Following https://vrs.ga4gh.org/en/stable/impl-guide/normalization.html, I get to the step 3a: is equal to the base preceding

How is this defined for the Alternate Allele Sequence?

Defining it as preceding reference sequence is problematic as if If I have phased A-T and an A insertion in a poly-A region, normalisation is lossy as it removes the information about whether the T occurs before or after the A ins.

Defining it as the preceding alt haplotype base is also problematic as 1) if there's a nearby unphased variant then the alt haplotype sequence is not known at that position and 2) this forces nearby variants to be merged into a single variant or (if the flanking variant causes the extension to stop) 3) changes the isolated normalised representation of the variant (e.g. the A INS no longer covers the full poly-A sequence, but if you renormalise it in isolation it gets expanded to the full poly-A).

How should an implementation normalise such variants?

andreasprlic commented 3 months ago

Hi Daniel, I believe the intention behind normalization is to avoid being overly-precise when referring to variants. While it is not exactly stated, we should also not become lossy and drop information. Would it be possible to provide the coordinates and ref/alt for the example you are describing? Then it is easier to follow / dig into the details... Thanks!

d-cameron commented 2 months ago

REF: TAAAAAAAT Variant 1: A>T @ position 3 Variant 2: A>AA @ position 5

How should Variant 2 be normalised? The bounds would be pos1 but Variant 1 is in the way and may or may not change the alt sequence (depends on cis/trans phasing)

d-cameron commented 2 months ago

If the variants are cis, the normalisation algorithm results in loss of information as it can be reconstructed as; TATAAAAAAT (correct) or TAATAAAAAT (wrong)

andreasprlic commented 2 months ago

The way I would normalize this is to create the full ref and full alt, and then run through normalize. The problem with your example is that the two variants both go away and get replaced with a single insertion of a T (at 2 in interbase coords).

Would a variant-caller detect this as a single insT allele? The two alleles in cis might not be detectable? I assume there is a background story why you would like to report this as a haplotype of the two composite alleles?

Here a unit test that reproduces this. This is basically what vrs-python is doing at its core during normalization:

from bioutils.normalize import normalize

def test_shuffling():

    ref = 'TAAAAAAAT'
    alt = 'TATAAAAAAT'

    chrom_seq = ref
    start = 0
    end = len(ref)
    shuffle_direction = 'EXPAND'

    shuffled_interval, shuffled_alleles = normalize(
            chrom_seq, interval=(start, end), alleles=(None, alt), mode=shuffle_direction
        )

    assert shuffled_interval == (2,2)
    assert shuffled_alleles[0] == ''
    assert shuffled_alleles[1] == 'T'
d-cameron commented 2 months ago

If the variants are unphased the two haplotypes are:

TAAAAAAAT/TATAAAAAAT (ref/both) or TATAAAAAT/TAAAAAAAAT (first/second)

create the full ref and full alt

In this context, what do you mean by 'full ref' and 'full alt'? If the variants aren't phased, we don't know what the alt haplotype actually is.

d-cameron commented 2 months ago

My point is that the normalisation procedure doesn't state whether the variant should be extended with the ref or the alt. If it's the ref then normalisation loses information about the relative position of variants. If it's the alt then there's an implicit requirement to both report and merge all variants in potentially-normalising variant positions - something that many data sources do not do. There's also the problem of what to do with unphased variants.

Would a variant-caller detect this as a single insT allele? The two alleles in cis might not be detectable? I assume there is a background story why you would like to report this as a haplotype of the two composite alleles?

The typical use case I encounter this issue is is with VNTRs/STRs with imperfect repeats in the reference (e.g. ACACACATACAC) but an expanded/contracted repeat in sample (e.g. T>C, insAC), or vice versa. These get reported in VCF as separate variants. The imperfect reference version of this issue is interesting because making the repeat perfect (i.e. the T>C) actually widens interval over which the other variant could have biologically occurred (i.e. the AC could be have been inserted anywhere in the repeat) - a behaviour that complicates evolutionary tree reconstruction.

andreasprlic commented 2 months ago

Reading your comment about repeats reminds me of the challenges with over-precision in the representation of variants in repeats. By left or right shuffling them, we pick one of several possible alternative alignments of how reads can map to the reference. However it is not possible to identify what specific nucleotide was inserted or deleted and there are multiple alternative solutions. Referring to the whole region of ambiguity can help with several applications and I do wonder if a fully-justified representation of the STRs would help with the evolutionary tree reconstruction as well.

In this context, what do you mean by 'full ref' and 'full alt'?

I meant to represent the haplotype of the two variants, in a fully justified representation.

You are doing that by this:

TAAAAAAAT/TATAAAAAAT (ref/both) or TATAAAAAT/TAAAAAAAAT (first/second)

Now the problem is that a variant caller (and our normalization) would call the (both) haplotype as an insT, so it is no longer a haplotype and we are dealing with a single allele. That might make this representation less useful for evolutionary tree reconstruction. In this case, "expanding" the region around the alt again (going back to the the TATAAAAAAT description), and using the 4 alleles as you described them ^^^ might actually be what is most useful? (I note the that start/end coordinates for the ref representation would be identical for all 4 variants and could perhaps be an anchor for aligning the sequences? just a wild guess from my side here).