exomiser / svart

API for use in representing and manipulating genomic variation
7 stars 1 forks source link

Add BaseBreakendVariant #21

Closed julesjacobsen closed 3 years ago

julesjacobsen commented 3 years ago

@ielis didn't you need to implement your own version of a BreakendVariant? This would have been useful right?

ielis commented 3 years ago

Yes, I resorted to implementing BreakendedSvannaVariant from scratch. The class implements SvannaVariant, Breakended, where SvannaVariant interface extends Variant and some other interfaces that add bits specific to SvAnna.

I think having BaseBreakendedVariant would be useful.

julesjacobsen commented 3 years ago

OK will add one. Will need one for Exomiser too...

julesjacobsen commented 3 years ago

In progress on this one. So far I've made a few changes such as:

@ielis I have a question which I want to double-check something. Can you confirm that the trailingRef variable is never used? From what I can see it's only ever set as an empty string in the constructor, so its a bit of legacy code which has been replaced with the existence of the VcfBreakendFormatter and soon the VcfBreakendVariantParser (issue #36)

/**
 * When variant is converted to the opposite strand, then the reverse complemented ref allele is stored here.
 * We need to store the allele in order to reconstruct the original ref allele if variant is converted again to the
 * original strand.
 */
private final String trailingRef;

@Override
public DefaultBreakendVariant toOppositeStrand() {
    return new DefaultBreakendVariant(eventId, right.toOppositeStrand(), left.toOppositeStrand(),
            Seq.reverseComplement(trailingRef), Seq.reverseComplement(ref), Seq.reverseComplement(alt));
}

works like this:

org.monarchinitiative.variant.api.impl.DefaultBreakendVariantTest,toOppositeStrand

// 13  123457  bnd_X   A   [17:198983[A    6   PASS    SVTYPE=BND;MATEID=bnd_Z;EVENT=tra3

Original: BreakendVariant{eventId='tra3', left=Breakend{contig=13, id='bnd_X', strand=-, coordinateSystem=FULLY_CLOSED, startPosition=115046422, endPosition=115046422}, right=Breakend{contig=17, id='bnd_Z', strand=+, coordinateSystem=FULLY_CLOSED, startPosition=198983, endPosition=198983}, ref=A, alt=''}
Opposite: BreakendVariant{eventId='tra3', left=Breakend{contig=17, id='bnd_Z', strand=-, coordinateSystem=FULLY_CLOSED, startPosition=83058459, endPosition=83058459}, right=Breakend{contig=13, id='bnd_X', strand=+, coordinateSystem=FULLY_CLOSED, startPosition=123457, endPosition=123457}, ref=, alt=''}

Is it still needed, or could be replaced with a constant empty string? I suspect this probably isn't always the case, for example if a toOppositeStranded BreakendVariant is copied e.g.

BreakendVariant transformed = BreakendVariant.builder().with(original.toOppositeStrand()).build().toOppositeStrand()

transformed should equal original

julesjacobsen commented 3 years ago

Just a note- might be easier to add a sequence to the Breakends, so it is always attached to the correct end and store the inserted sequence, if any, on the BreakendVariant although this is what the alt field currently does.

julesjacobsen commented 3 years ago

See commit: 967ff30

ielis commented 3 years ago

Hi @julesjacobsen , the trailingRef field is used in toOppositeStrand() method, where it is passed instead of ref. It's not a legacy code.

When creating a breakend variant from a VCF record, we have base sequence for ref allele for left breakend only. In case we convert the breakend variant to the opposite strand, we must set ref to an empty string, since we do not have the base sequence for the opposite strand. If we want to convert the variant back to the original strand, we will miss the ref allele if we do not store it. That's why we store the ref allele into trailingRef field and use the field during the strand conversion.

julesjacobsen commented 3 years ago

@ielis, it's really just a bit of a mess which is why I dislike it. This is related to the question in issue #39. The problem arises due to the inclusion of the 'left' ref base, but not that on the 'right', so when flipping the strands you end up losing information.

The solutions are:

  1. Keep a reference to the original and only show it when in the original configuration (current, moderate wtf score)
  2. Trim away the ref allele and only include any inserted sequence if present leading to either two empty ends or an empty end and an inserted end. Downside here is that it depends on people doing this and is inconsistent with the presentation of all other Variant types)
  3. Just return this when trying to do a BreakendVariant.toOppositeStrand(). This way nothing changes, and nothing is screwy. The reason being if you're to assess whether a breakend variant overlaps a region (i.e. disrupts the region), you'll need to check each Breakend independently by calling breakend.toStrand(targetRegion.strand()). This is my preferred solution and I think has the lowest wtf score.
ielis commented 3 years ago

I agree that the solution 2. is not a good one as we should not expect the users to do much in order to make svart's classes work best.

Let's go for the solution 3 then, we can check the breakends independently.

julesjacobsen commented 3 years ago

Great - will do 3 is also easiest time implement ;)

I think adding a Variant.isBreakend() method will help with this. Will need to check that isSymbolic() != isBreakend()

julesjacobsen commented 3 years ago

Unfortunately BreakendVariant.isSymbolic() == true so will need to have a think about what defines 'symbolic' as opposed to breakend/non-symbolic. In VCF parlance I think a breakend is symbolic, so we'll need to think about wether to change the definition here.

It's possible to do a simple instanceof test, but these always feel a bit out of place.