exomiser / svart

API for use in representing and manipulating genomic variation
7 stars 1 forks source link

Define complete behaviour for `withStrand` with respect to Strand.UNSTRANDED and Strand.UNKNOWN #9

Closed ielis closed 4 years ago

ielis commented 4 years ago

We need to specify what happens when e.g.

GenomicPosition pos = GenomicPosition.oneBased(ctg1, Strand.POSITIVE, Position.of(4));
pos = pos.withStrand(Strand.UNKNOWN).withStrand(Strand.NEGATIVE);

I suggest that we specify a concept like conversion legality. If the conversion is legal, then a new object (position, region, variant, etc.) is returned. When attempting to perform an illegal conversion, this is returned.

In general, it is possible to convert a type from a more specific Strand to e.g. Strand.UNKNOWN.

The table specifies the conversion legality rules:

CurrentTargetLegal
POSITIVEPOSITIVEYes
NEGATIVEYes
UNSTRANDEDYes
UNKNOWNYes
NEGATIVEPOSITIVEYes
NEGATIVEYes
UNSTRANDEDYes
UNKNOWNYes
UNSTRANDEDPOSITIVENo
NEGATIVENo
UNSTRANDEDYes
UNKNOWNYes
UNKNOWNPOSITIVENo
NEGATIVENo
UNSTRANDEDNo
UNKNOWNYes

The #3 contains an example implementation of withStrand in GenomicPositionDefault.

julesjacobsen commented 4 years ago

Is there a good use case for wanting to be able to do this? The real intention of withStrand is to be able to normalise to a positive or negative strand (as in Jannovar which only has + or -). Converting a position on a POSITIVE strand of a chromosome to UNKNOWN or UNSTRANDED and then to NEGATIVE seems odd.

withStrand could simply return a new instance with the strand specified. In the case of + or - the new instances should be the reverse complement. Requesting a . or ? will result in a copy with just the strand changed to the requested one.

i.e.

    @Override
    public SequenceVariant withStrand(Strand other) {
        if (strand == other) {
            return this;
        }
        if (other== UNKNOWN || other== UNSTRANDED) {
            return new SequenceVariant(contig, id, other, coordinateSystem, start, end, ref, alt);
        }

        if (strand.isComplementOf(other)) {
            Position start = startPosition.invert(contig, coordinateSystem);
            Position end = endPosition.invert(contig, coordinateSystem);
            return new SequenceVariant(contig, id, other, coordinateSystem, end, start, Seq.reverseComplement(ref), Seq.reverseComplement(alt));
        }
    }
ielis commented 4 years ago

Converting a position on a POSITIVE strand of a chromosome to UNKNOWN or UNSTRANDED and then to NEGATIVE indeed seems odd, however, it is possible. I think we should write the code in a way, which would prevent mishaps. When using withStrand as defined above, then after running:

SequenceVariant variant  = ... // assume positive strand at the beginning
SequenceVariant converted = variant.withStrand(Strand.UNSTRANDED).withStrand(Strand.POSITIVE);

the converted.strand() == Strand.POSITIVE, while coordinates are in fact on Positive.NEGATIVE due to invert statements. This is wrong, the user has however no chance to find that the coordinates became invalid.

In the case of the code I proposed, converted.strand() == Strand.UNSTRANDED and coordinates remain as defined at the beginning. Converting from UNSTRANDED to POSITIVE was not possible. This is still not ideal, but at least there is a chance to see that something went wrong.

julesjacobsen commented 4 years ago

I was giving @pnrobinson a quick tour of the variant-api today and he was suggesting that it might not be sensible including UNKNOWN or UNSTRANDED. It would certainly simplify a lot of the current code to to with strandedness.

@pnrobinson can you explain your argument again?

julesjacobsen commented 4 years ago

OK will try to summarise - essentially the +/- is important and for our purposes the only relevant options. In cases where there are only single stranded contigs the strand can be considered + by convention and in cases where there is an insertion from somewhere it can be considered as being on the same strand it is inserted into. Features with ? or . strands should be ignored. For example these are not present in the gff/gtf of Ensembl / MANE transcript data.

pnrobinson commented 4 years ago

It seems there are two meanings of "strand"

  1. administrative -- basically, "we have agreed to count from this end or that end of the molecule
  2. biological -- this gene is transcribed in this direction or that direction.

It seems that we can always determine the administrative strand in the context of Jannovar/Exomiser -- by the time we get the data, somebody has decided how the sequence is oriented, and if we do not know this, we cannot actually proceed with the analysis.

There are many biological entities where the strand is not important (e.g., enhancer) or unknown (e.g., CHIP-seq peak). But this does not mean that the administrative strand is unknown.

I think there are a few possible exceptions to this in some of the BNDs, e.g. for insertions, where we just get a raw sequence. However, it seems easiest to say that the indicated sequence is PLUS strand, and the reverse complement of the sequence would be MINUS.

julesjacobsen commented 4 years ago

No longer an issue as we have removed . and ?