biosails / pheniqs

Fast and accurate sequence demultiplexing
Other
26 stars 4 forks source link

support for passthrough auxiliary tags #2

Closed moonwatcher closed 2 years ago

moonwatcher commented 6 years ago

Pheniqs manipulates some of the SAM auxiliary tags during demultiplexing but there is still the issue of how to handle pre-existing tags when the inputs are SAM.

demultiplexing previously processed SAM files can benefit from carrying over auxiliary tags from input to output. During demultiplexing read segments are rearranged into new read segments. This raises the question of how to decide which auxiliary tags of each input segments are replicated on which output read segment.

one very brute option is to set a leader segment and only copy tags from it to all output segments.

a more subtle option is to expand the transform syntax to list which tags to copy. For instance: < input segment index >[:<two character auxiliary tag code>]+ for example 0:BC:QT will mean copy BC and QT from input segment 0.

in terms of implementation, pheniqs does not need to actually decode tags it does not interact with. Sufficient is that an Auxiliary object keep an unordered_map (a hash table) with 2 character code to byte array pointer. during decoding the pointer to the byte array is populated and during encoding that byte array can be copied as is.

the marginal cost this will probably be unnoticeable because it will take place when processing threads are waiting for IO.