cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

rnaseqqc - subsetRange produces invalid fastq files #145

Closed IanSudbery closed 2 years ago

IanSudbery commented 2 years ago

The RNAseqQC pipeline task "subsetRange" produces a set of subsets from the highest depth file in a run using mapper.SubsetHeads. This builds a statement by multipling the number of required readsa by 4 and asking for rows less than this:

https://github.com/cgat-developers/cgat-flow/blob/bc423e431cee5b76f2e8cdc6b8d8a44935b85a75/cgatpipelines/tasks/mapping.py#L1314-L1318

However, awk's NR counts from 1, not 0, and so the last read in every file has only 3 lines. Really, its neccessary to use NR<=%(limit)s.

Also, this is SUPER slow. processing my 40m read sample took 6 hours.

I will send a PR when i've got the whole pipeline running through.