Fix double-counting of genes and transcripts in report

porchard commented 8 months ago

When generating the gene and transcript counts for the HTML report (section "Single cell sample summary"), the read_tags file is read in chunks, and total_genes and total_transcripts are incremented based on the number of unique genes / transcripts in each chunk. This results in double-counting of genes / transcripts that occur in multiple chunks. This PR fixes the problem by recording unique gene / transcript names in a set and simply taking the length of those sets after having processed the entire read_tags file.

In the library I'm currently looking at (reference genome hg38), before the fix total_genes and total_transcripts are 107209 and 91999, respectively, despite the fact that there are < 60k genes in the GTF file. After the fix, these values are 34521 and 77632, which seems more reasonable.

cjw85 commented 7 months ago

Thanks Peter,

I have pulled this branch to our internal repository and will let it run through our CI testing. (We don't have public CI testing unfortunately).

cjw85 commented 7 months ago

This has been merged into our internal development branch. Its available in the prerelease branch on github.

epi2me-labs / wf-single-cell

Fix double-counting of genes and transcripts in report #84