Closed rpetit3 closed 1 year ago
When we validated our results, we did not see any inaccuracies. But, you are right there are chances of getting wrong counts especially when '@" is at the beginning of the line for quality score. We are incorporating these changes in the upcoming version.
Describe the bug The calcualtions for
READS_LEN
andNUM_READS
in downsample_rate.nf use the@
symbol at the start of the line. Unfortunately, the@
symbol is also a quality score which can occur at the start of line.Impact As the number of quality lines that start with
@
goes up,READS_LEN
becomes more undercounted andNUM_READS
becomes more overcounted.Example
Unaffected FASTQ
A FASTQ with a
@
at the start of a quality lineUsing the calculations for READS_LEN and NUM_READS
With the unaffected fastq we get:
For the FASTQ with an @ symbol we get:
When a quality line starts with
@
instead of using the length of the read, it ends up using the length of the header. In the example above this caused a difference of 136 base pairs between the two FASTQs. This also caused count to increase.Currently whether
--rate
and--coverage
is used or not, every sample goes through theDOWNSAMPLE_RATE
process and is affected by this, making any sort of validations inaccurate.