tiny-count: support for sequence-based read counting

AlexTate commented 1 year ago

tiny-count can now perform sequence-based counting as an alternative to feature-based counting. This is useful to users who don't have GFF annotations for their experiment, or users who want to count reads against a set of known sequences.

tiny-count automatically switches to this counting mode when a user's Paths File doesn't have any GFFs listed. tinyRNA no longer requires GFF files at pipeline startup.

Technical Details

In sequence-based counting mode, Stage 1 selectors cannot be evaluated and are therefore ignored (Select for..., with value..., Classify as..., Source Filter, and Type Filter ). Stage 2 and Stage 3 are evaluated as they would be for feature-based counting.

In sequence-based counting mode, "feature" intervals are defined by the @SQ headers of input SAM files. These headers only define a sequence identifier, which is used as the "Feature ID", and a length for each sequence. These headers correspond to the reference fasta that the reads were aligned against (in bowtie's case, this is the fasta input to bowtie-build). Reads are counted for alignments to each of these reference sequences on both strands. Unlike in feature-based counting, all rules are evaluated for all reference sequences in Stages 2 and 3.

SAM @SQ headers are evaluated to ensure that they are present in each file, they contain the required fields, no identifier appears more than once in each file, and identifiers have a consistent length indicated in the headers of all input SAM files.

Closes #277

AlexTate commented 1 year ago

The ReferenceFeatures class (formerly ReferenceTables) has many changes in this PR but it looks worse than it is.

The only functional changes involved copying a few methods to the base class, ReferenceBase, and refactoring the method signature of __init__() so that that the selector argument is instead passed via get() (see 15e2795). All other changes are the result of reordering the class' methods to group them more logically and improve readability (see 3247ea6)

taimontgomery commented 1 year ago

Tested successfully with ram1 and lib303 datasets and full feature sets or no feature sets. Lib303 dataset aligned to cel_miRNAs.fa and miRNA counts were consistent with full genome alignment.

MontgomeryLab / tinyRNA

tiny-count: support for sequence-based read counting #279

Technical Details