CDCgov / mycosnp-nf

CDCgov/mycosnp-nf
Apache License 2.0
36 stars 33 forks source link

Urgent: downsample_rate.nf incorrectly counts total basepairs and reads #90

Closed rpetit3 closed 1 year ago

rpetit3 commented 1 year ago

Describe the bug The calcualtions for READS_LEN and NUM_READS in downsample_rate.nf use the @ symbol at the start of the line. Unfortunately, the @ symbol is also a quality score which can occur at the start of line.

Impact As the number of quality lines that start with @ goes up, READS_LEN becomes more undercounted and NUM_READS becomes more overcounted.

Example

Unaffected FASTQ

@SRR13965548.14
GTCCTAGATGCAGCTCGCCGCCGTGATAATGGCGAAAACATTAGTACAGAAGCATCTTGCGCAAAAATGTTTGCGACTGAAATGTGCGGACGTGTTGCTGACCGCTGTGTACAGATTCACGGTGGTGCGGGCTATATCAGTGAATATGCCA
+
BBBBBFFBF5@45CGB22EEEEEECDHFBHHFBEGEGGHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHGGGGGHHGHHHHHGGGGGGGGHHHHHHHHGGGGGHHHHHHHHHHHHHGGHGGGHGGGGGHHHHHGHGHHHHHGHGHG
@SRR13965548.15
TGGTTTCTCCTTGACTCTCTCTTAATCTAGCTTCAGATCAAGATAAAAAGGAATACTCCATGAAAAAGCCTATTGTTCTCGTACTTTCAACATTAATGTTGGGTATGTCGGCAACTGCGATGGCCGATTCAAATCATCGTTGGGATAACTA
+
AAAAAFFFFFFF11AFEGEGDGHBBGHHBBGHHHHHBHHHHGGHHHHHHGGHHGHHHHHHHHHHHHHGHHHHHHGHHHHHGGGHHHHHHHHHHHHHHHFHHHGGHHHHHHGGGGHHHHGGHGGHHGGGGHHHFHHHHHHHHGHGHHGHHHH

A FASTQ with a @ at the start of a quality line

@SRR13965548.14
GTCCTAGATGCAGCTCGCCGCCGTGATAATGGCGAAAACATTAGTACAGAAGCATCTTGCGCAAAAATGTTTGCGACTGAAATGTGCGGACGTGTTGCTGACCGCTGTGTACAGATTCACGGTGGTGCGGGCTATATCAGTGAATATGCCA
+
@BBBBFFBF5@45CGB22EEEEEECDHFBHHFBEGEGGHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHGGGGGHHGHHHHHGGGGGGGGHHHHHHHHGGGGGHHHHHHHHHHHHHGGHGGGHGGGGGHHHHHGHGHHHHHGHGHG
@SRR13965548.15
TGGTTTCTCCTTGACTCTCTCTTAATCTAGCTTCAGATCAAGATAAAAAGGAATACTCCATGAAAAAGCCTATTGTTCTCGTACTTTCAACATTAATGTTGGGTATGTCGGCAACTGCGATGGCCGATTCAAATCATCGTTGGGATAACTA
+
AAAAAFFFFFFF11AFEGEGDGHBBGHHBBGHHHHHBHHHHGGHHHHHHGGHHGHHHHHHHHHHHHHGHHHHHHGHHHHHGGGHHHHHHHHHHHHHHHFHHHGGHHHHHHGGGGHHHHGGHGGHHGGGGHHHFHHHHHHHHGHGHHGHHHH

Using the calculations for READS_LEN and NUM_READS

READS_LEN=\$(zcat ${reads} | awk '/^@/ {getline; len+=length(\$0)} END {print len}')
NUM_READS=\$(zcat ${reads[0]} | awk '/^@/ {lines+=1} END {print lines}')

With the unaffected fastq we get:

# READS_LEN
cat good.fastq | awk '/^@/ {getline; len+=length($0)} END {print len}'
302

# NUM_READS
cat good.fastq | awk '/^@/ {lines+=1} END {print lines}'
2

For the FASTQ with an @ symbol we get:

# READS_LEN
cat good.fastq | awk '/^@/ {getline; len+=length($0)} END {print len}'
166

# NUM_READS
cat good.fastq | awk '/^@/ {lines+=1} END {print lines}'
3

When a quality line starts with @ instead of using the length of the read, it ends up using the length of the header. In the example above this caused a difference of 136 base pairs between the two FASTQs. This also caused count to increase.

Currently whether --rate and --coverage is used or not, every sample goes through the DOWNSAMPLE_RATE process and is affected by this, making any sort of validations inaccurate.

urbagal commented 1 year ago

When we validated our results, we did not see any inaccuracies. But, you are right there are chances of getting wrong counts especially when '@" is at the beginning of the line for quality score. We are incorporating these changes in the upcoming version.