Pipeline: tagged counting repurposed as classifier

AlexTate commented 1 year ago

The Tag column has been renamed to Classify as... and will be used to apply a user-defined class to features that match the rule. The Class= attribute is no longer used to determine a feature's class. Tagged counting semantics still apply.

The counts table produced by tiny-count therefore now has a multiindex of (Feature ID, Classifier). Backward compatibility is not offered for counts tables produced by an earlier version of tinyRNA. The Features Sheet is checked for the presence of a Tag column at pipeline/tiny-count startup and, if present, an error is produced along with steps to fix it.

These changes opened the door for some very satisfying improvements to the code quality in plotter.py. Two additional parameters have been added to the pipeline/tiny-plot:

--unassigned-class: the label to use for unassigned counts in class_charts
--unknown-class: the label to use for counts assigned by rules lacking a Classify as... value. This is used in class_charts and scatter_dge_class.

Closes #240

AlexTate commented 1 year ago

Since this PR introduces changes that are backward incompatible, I would like to make a release for the project in its current state before this one is merged.

taimontgomery commented 1 year ago

With this new, much improved approach to classification, won't the class and rule plots always be the same? And thus can we get rid of the rules plots? Perhaps also change counts_by_rule.csv to counts_by_classification.csv, changing the Rule String column to Classification?

AlexTate commented 1 year ago

No, class and rule plots will differ if any rules share a Classify as... value. Rule plots can be used in this case to see how much each rule contributed to the pooled classes. For this reason I think the proposed changes to output files would be incorrect

taimontgomery commented 1 year ago

I see. In that case, perhaps we can add a counts_by_classification.csv table at some point.

taimontgomery commented 1 year ago

Tested successfully with ram1 data.

MontgomeryLab / tinyRNA

Pipeline: tagged counting repurposed as classifier #241