fulcrumgenomics / stitch

Stitch is a toolkit for analysis of chimeric reads in sequencing data
MIT License
9 stars 3 forks source link

Ability to align against a multi-contig construct #56

Open jdidion opened 12 months ago

jdidion commented 12 months ago

Often, multiple contigs in the reference fasta will belong to the same construct (such as a viral vector or plasmid). It would be nice if there were a way to describe the ordering of the component contigs in a construct, and annotate a construct as circular (so that jumps from the end of the last construct to the beginning of the first would have zero cost).

I can think of a number of ways to do this (comma-delimited list of contigs on the command line; custom format text or JSON file) but would prefer a standardized format if one exists. Maybe GFA?

nh13 commented 12 months ago

Why not use the sequence dictionary (.dict) file? We already perform a lookup if it exists to find annotated circular contigs using the @SQ.TP tag.

I am tempted to do this by convention, for example, use the @SQ.AS tag to identify contigs in a linear chain (canonical plasmid/vector name) with the @SQ.TP tag set to linear for all but the last contig in the chain (then set to circular), but this feels against the intent of SAM.

There's no reason we cannot instead have lower case tags to denote this information. Thoughts?

jdidion commented 12 months ago

That could work.

I am okay with stipulating that at least one of the following must be true for entries in the dict with the same AS:

  1. They need to be in the same order they appear in the construct
  2. They all need to be annotated with a custom tag indicating their index within the construct

Circularity is a bit less clear. We could assume that if any contig in the construct has TP:circular then the entire construct is considered circular. Alternatively, we could introduce a custom tag to denote a circular construct, and/or provide an option to specify the AS of any constructs in the dict to be considered circular.

nh13 commented 12 months ago

We could also just have “current index” and “next index”?

jdidion commented 12 months ago

That works - takes care of both ordering and circularity.

I do think we should also have a default interpretation when there are multiple sequences with the same AS and there are no custom tags. I suggest to assume the contigs are ordered, and the construct is circular if any sequence is annotated as circular. Either that or show a warning that the sequences will be treated as independent despite having the same AS.