My comment is primary a concern for BAMs, but if it can be added to both code paths, great:
The input data fed to nimble should be unique pairs of sequence reads, where each readname is represented once per pair.
A BAM file is a format to store alignments, which are not strictly the same as reads. A given read can in theory be present more than once in a BAM, if there are two alignments. I dont think cellranger does this, but the BAM would be technically valid if it did.
Nimble already sorts the BAMs prior to input based on UMI and read name. It then iterates the BAM.
Would it be practical for the reader code of nimble to remember the name of the last read it encountered, and throw an exception if the next read has the same name? If simple to implement, this would provide cheap insurance against a category of issue that would be easy to have, and hard to identify.
My comment is primary a concern for BAMs, but if it can be added to both code paths, great:
Would it be practical for the reader code of nimble to remember the name of the last read it encountered, and throw an exception if the next read has the same name? If simple to implement, this would provide cheap insurance against a category of issue that would be easy to have, and hard to identify.