Closed porchard closed 7 months ago
Thanks Peter,
I have pulled this branch to our internal repository and will let it run through our CI testing. (We don't have public CI testing unfortunately).
This has been merged into our internal development branch. Its available in the prerelease
branch on github.
When generating the gene and transcript counts for the HTML report (section "Single cell sample summary"), the
read_tags
file is read in chunks, andtotal_genes
andtotal_transcripts
are incremented based on the number of unique genes / transcripts in each chunk. This results in double-counting of genes / transcripts that occur in multiple chunks. This PR fixes the problem by recording unique gene / transcript names in a set and simply taking the length of those sets after having processed the entireread_tags
file.In the library I'm currently looking at (reference genome hg38), before the fix
total_genes
andtotal_transcripts
are 107209 and 91999, respectively, despite the fact that there are < 60k genes in the GTF file. After the fix, these values are 34521 and 77632, which seems more reasonable.