PoonLab / gromstole

Quantifying SARS-CoV-2 VoCs from NGS data of wastewater samples
MIT License
3 stars 2 forks source link

Diagnose sawtooth coverage issue #16

Closed ArtPoon closed 2 years ago

ArtPoon commented 2 years ago

Many coverage plots are turning out like this:

This doesn't make sense because the tiled amplicons should be overlapping, which should result in a more even coverage plot, like this:

ArtPoon commented 2 years ago

I reprocessed the FASTQ files for the first example to capture the cutadapt message logs to check if the problem is due to trimming low quality bases from the ends of reads:

Processing reads on 4 cores in paired-end mode ...
Finished in 7.84 s (48 us/read; 1.24 M reads/minute).

=== Summary ===

Total read pairs processed:            162,416
  Read 1 with adapter:                   3,622 (2.2%)
  Read 2 with adapter:                   3,548 (2.2%)
Pairs that were too short:                   0 (0.0%)
Pairs written (passing filters):       162,416 (100.0%)

Total basepairs processed:    57,860,326 bp
  Read 1:    28,847,779 bp
  Read 2:    29,012,547 bp
Total written (filtered):     57,833,727 bp (100.0%)
  Read 1:    28,834,147 bp
  Read 2:    28,999,580 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 3622 times.

No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 42.5%
  C: 20.9%
  G: 19.3%
  T: 17.3%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   2140    2537.8  0   2140
4   832 634.4   0   832
5   580 158.6   0   580
6   9   39.7    0   9
7   6   9.9 0   6
10  52  0.2 1   0 52
72  1   0.0 1   0 1
122 1   0.0 1   0 1
174 1   0.0 1   0 1

=== Second read: Adapter 2 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 3548 times.

No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 40.5%
  C: 21.6%
  G: 19.0%
  T: 19.0%
  none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   2160    2537.8  0   2160
4   791 634.4   0   791
5   533 158.6   0   533
6   10  39.7    0   10
7   11  9.9 0   11
10  41  0.2 1   0 41
23  1   0.0 1   0 1
88  1   0.0 1   0 1

This pair of example FASTQs have 162416 reads each.

ArtPoon commented 2 years ago

Looks like it's unlikely this is the problem.

ArtPoon commented 2 years ago

I guess next step is a sanity check - grab a short substring in a region with no coverage and grep the original data files for it.

ArtPoon commented 2 years ago

Is it possible that we're somehow systematically throwing out the second read of every pair?

ArtPoon commented 2 years ago

I just dug into the original FASTQ files - it looks like it's something that's happening upstream of us. Using the following script:

from pathlib import Path
import gzip
import itertools
import os

outfile = open('lengths.csv', 'w')
outfile.write("filename,length,count\n")

for path in Path('/data/wastewater/uploads/').rglob('*.fastq.gz'):
    f = os.path.basename(path)
    print(f)
    fs = gzip.open(path, 'rt')
    lengths = {}
    for header, seq, _, qual in itertools.zip_longest(fs, fs, fs, fs):
        l = len(seq.strip())
        if l not in lengths:
            lengths.update({l: 0})
        lengths[l] += 1
    fs.close()

    for l, count in lengths.items():
        outfile.write('{},{},{}\n'.format(f, l, count))

outfile.close()
ArtPoon commented 2 years ago

image

ArtPoon commented 2 years ago

Over half of the reads in the above plot are below 100nt in length.

ArtPoon commented 2 years ago

Isaac in Trevor's lab reminded us that the amplicons are in two pools, so if there is a severe imbalance in quantification between pools then we'd get a sawtooth pattern.