Streamlining HTML report code

standage commented 6 months ago

The purpose of this branch is to clean up the code responsible for collating and rendering the HTML report for the end-to-end analysis pipeline.

[x] Changes are clearly described above
[x] Any relevant issue threads are referenced in the description
[x] Any new features are tested (see the development manual for details)
[x] CLI documentation (see docs/cli.md) and Python API documentation (see microhapulator/api.py) are up-to-date and in sync
[x] Substantial changes are documented in CHANGELOG.md (see https://keepachangelog.com/en/1.0.0/)

standage commented 6 months ago

I've drafted some classes for aggregating the data for the first QC section of the report in qcstats.py and qcsummaries.py. The following demonstrates how to access the variables needed to populate the report.

>>> from microhapulator.qcsummary import PairedReadQCSummary, SingleEndReadQCSummary
>>> 
>>> qc = SingleEndReadQCSummary.collect(["SRM8398-1", "SRM8398-2", "SRM8398-3"], workdir="scratch/WD_clean2/")
>>> for sample, stats in qc.items():
...   print(sample, stats.total_reads, stats.filtered_ambig, stats.filtered_length, stats.retention, sep="\t")
... 
SRM8398-1       211,819 106,476 (50.3%) 6,092 (2.9%)    99,251 (46.9%)
SRM8398-2       328,618 180,940 (55.1%) 7,471 (2.3%)    140,207 (42.7%)
SRM8398-3       245,967 107,726 (43.8%) 8,225 (3.3%)    130,016 (52.9%)
>>> 
>>> 
>>> qc = PairedReadQCSummary.collect(["SRM8398-1", "SRM8398-2", "SRM8398-3"], workdir="scratch/WD_nimagen_testC")
>>> for sample, stats in qc.items():
...     print(sample, stats.ambig.total_reads, stats.ambig.excluded_r1, stats.ambig.excluded_r2, stats.ambig.excluded_both, stats.ambig.excluded, stats.ambig.retained, stats.ambig.retention_rate, sep="\t")
... 
SRM8398-1       134,086 827     477     42,698  44,002  90084   67.2%
SRM8398-2       208,872 1,152   1,767   77,869  80,788  128084  61.3%
SRM8398-3       167,704 883     462     45,617  46,962  120742  72.0%
>>> 
>>> for sample, stats in qc.items():
...   print(sample, stats.merge.total_reads, stats.merge.merged_reads, stats.merge.merge_rate, sep="\t")
... 
SRM8398-1       90,084  89,099  98.9%
SRM8398-2       128,084 126,633 98.9%
SRM8398-3       120,742 119,772 99.2%
>>> 
>>> for sample, stats in qc.items():
...   print(sample, stats.length.total_reads, stats.length.excluded, stats.length.kept, stats.length.retention_rate, sep="\t")
... 
SRM8398-1       89,099  2,452   86,647  97.2%
SRM8398-2       126,633 3,298   123,335 97.4%
SRM8398-3       119,772 3,546   116,226 97.0%
>>>

agshumate commented 6 months ago

The code definitely looks a lot cleaner and functional review looks good too. One note is that in the future, if we implement other filters for paired reads, we should generalized the PairedAmbiguityFilterStats class to just PairedFilterStats as we will want to track the same stats for any filtering that we do. But i don't think we need to worry about that yet since currently the only filter was have for paired-end data is ambiguity filtering.

bioforensics / MicroHapulator

Streamlining HTML report code #175