Closed ielis closed 4 years ago
Is there a good use case for wanting to be able to do this? The real intention of withStrand
is to be able to normalise to a positive or negative strand (as in Jannovar which only has + or -). Converting a position on a POSITIVE
strand of a chromosome to UNKNOWN
or UNSTRANDED
and then to NEGATIVE
seems odd.
withStrand
could simply return a new instance with the strand specified. In the case of + or - the new instances should be the reverse complement. Requesting a . or ? will result in a copy with just the strand changed to the requested one.
i.e.
@Override
public SequenceVariant withStrand(Strand other) {
if (strand == other) {
return this;
}
if (other== UNKNOWN || other== UNSTRANDED) {
return new SequenceVariant(contig, id, other, coordinateSystem, start, end, ref, alt);
}
if (strand.isComplementOf(other)) {
Position start = startPosition.invert(contig, coordinateSystem);
Position end = endPosition.invert(contig, coordinateSystem);
return new SequenceVariant(contig, id, other, coordinateSystem, end, start, Seq.reverseComplement(ref), Seq.reverseComplement(alt));
}
}
Converting a position on a POSITIVE strand of a chromosome to UNKNOWN or UNSTRANDED and then to NEGATIVE indeed seems odd, however, it is possible. I think we should write the code in a way, which would prevent mishaps.
When using withStrand
as defined above, then after running:
SequenceVariant variant = ... // assume positive strand at the beginning
SequenceVariant converted = variant.withStrand(Strand.UNSTRANDED).withStrand(Strand.POSITIVE);
the converted.strand() == Strand.POSITIVE
, while coordinates are in fact on Positive.NEGATIVE due to invert
statements. This is wrong, the user has however no chance to find that the coordinates became invalid.
In the case of the code I proposed, converted.strand() == Strand.UNSTRANDED
and coordinates remain as defined at the beginning. Converting from UNSTRANDED to POSITIVE was not possible. This is still not ideal, but at least there is a chance to see that something went wrong.
I was giving @pnrobinson a quick tour of the variant-api today and he was suggesting that it might not be sensible including UNKNOWN
or UNSTRANDED
. It would certainly simplify a lot of the current code to to with strandedness.
@pnrobinson can you explain your argument again?
OK will try to summarise - essentially the +/- is important and for our purposes the only relevant options. In cases where there are only single stranded contigs the strand can be considered + by convention and in cases where there is an insertion from somewhere it can be considered as being on the same strand it is inserted into. Features with ? or . strands should be ignored. For example these are not present in the gff/gtf of Ensembl / MANE transcript data.
It seems there are two meanings of "strand"
It seems that we can always determine the administrative strand in the context of Jannovar/Exomiser -- by the time we get the data, somebody has decided how the sequence is oriented, and if we do not know this, we cannot actually proceed with the analysis.
There are many biological entities where the strand is not important (e.g., enhancer) or unknown (e.g., CHIP-seq peak). But this does not mean that the administrative strand is unknown.
I think there are a few possible exceptions to this in some of the BNDs, e.g. for insertions, where we just get a raw sequence. However, it seems easiest to say that the indicated sequence is PLUS strand, and the reverse complement of the sequence would be MINUS.
No longer an issue as we have removed . and ?
We need to specify what happens when e.g.
I suggest that we specify a concept like conversion legality. If the conversion is legal, then a new object (position, region, variant, etc.) is returned. When attempting to perform an illegal conversion,
this
is returned.In general, it is possible to convert a type from a more specific
Strand
to e.g.Strand.UNKNOWN
.The table specifies the conversion legality rules:
The #3 contains an example implementation of
withStrand
inGenomicPositionDefault
.