Open jdidion opened 12 months ago
Why not use the sequence dictionary (.dict
) file? We already perform a lookup if it exists to find annotated circular contigs using the @SQ.TP
tag.
I am tempted to do this by convention, for example, use the @SQ.AS
tag to identify contigs in a linear chain (canonical plasmid/vector name) with the @SQ.TP
tag set to linear
for all but the last contig in the chain (then set to circular
), but this feels against the intent of SAM.
There's no reason we cannot instead have lower case tags to denote this information. Thoughts?
That could work.
I am okay with stipulating that at least one of the following must be true for entries in the dict with the same AS
:
Circularity is a bit less clear. We could assume that if any contig in the construct has TP:circular
then the entire construct is considered circular. Alternatively, we could introduce a custom tag to denote a circular construct, and/or provide an option to specify the AS
of any constructs in the dict to be considered circular.
We could also just have “current index” and “next index”?
That works - takes care of both ordering and circularity.
I do think we should also have a default interpretation when there are multiple sequences with the same AS
and there are no custom tags. I suggest to assume the contigs are ordered, and the construct is circular if any sequence is annotated as circular
. Either that or show a warning that the sequences will be treated as independent despite having the same AS
.
Often, multiple contigs in the reference fasta will belong to the same construct (such as a viral vector or plasmid). It would be nice if there were a way to describe the ordering of the component contigs in a construct, and annotate a construct as circular (so that jumps from the end of the last construct to the beginning of the first would have zero cost).
I can think of a number of ways to do this (comma-delimited list of contigs on the command line; custom format text or JSON file) but would prefer a standardized format if one exists. Maybe GFA?