Closed IanSudbery closed 2 years ago
The RNAseqQC pipeline task "subsetRange" produces a set of subsets from the highest depth file in a run using mapper.SubsetHeads. This builds a statement by multipling the number of required readsa by 4 and asking for rows less than this:
https://github.com/cgat-developers/cgat-flow/blob/bc423e431cee5b76f2e8cdc6b8d8a44935b85a75/cgatpipelines/tasks/mapping.py#L1314-L1318
However, awk's NR counts from 1, not 0, and so the last read in every file has only 3 lines. Really, its neccessary to use NR<=%(limit)s.
NR<=%(limit)s
Also, this is SUPER slow. processing my 40m read sample took 6 hours.
I will send a PR when i've got the whole pipeline running through.
The RNAseqQC pipeline task "subsetRange" produces a set of subsets from the highest depth file in a run using mapper.SubsetHeads. This builds a statement by multipling the number of required readsa by 4 and asking for rows less than this:
https://github.com/cgat-developers/cgat-flow/blob/bc423e431cee5b76f2e8cdc6b8d8a44935b85a75/cgatpipelines/tasks/mapping.py#L1314-L1318
However, awk's NR counts from 1, not 0, and so the last read in every file has only 3 lines. Really, its neccessary to use
NR<=%(limit)s
.Also, this is SUPER slow. processing my 40m read sample took 6 hours.
I will send a PR when i've got the whole pipeline running through.