Closed ArtPoon closed 2 years ago
I reprocessed the FASTQ files for the first example to capture the cutadapt
message logs to check if the problem is due to trimming low quality bases from the ends of reads:
Processing reads on 4 cores in paired-end mode ...
Finished in 7.84 s (48 us/read; 1.24 M reads/minute).
=== Summary ===
Total read pairs processed: 162,416
Read 1 with adapter: 3,622 (2.2%)
Read 2 with adapter: 3,548 (2.2%)
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 162,416 (100.0%)
Total basepairs processed: 57,860,326 bp
Read 1: 28,847,779 bp
Read 2: 29,012,547 bp
Total written (filtered): 57,833,727 bp (100.0%)
Read 1: 28,834,147 bp
Read 2: 28,999,580 bp
=== First read: Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 3622 times.
No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 42.5%
C: 20.9%
G: 19.3%
T: 17.3%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 2140 2537.8 0 2140
4 832 634.4 0 832
5 580 158.6 0 580
6 9 39.7 0 9
7 6 9.9 0 6
10 52 0.2 1 0 52
72 1 0.0 1 0 1
122 1 0.0 1 0 1
174 1 0.0 1 0 1
=== Second read: Adapter 2 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 3548 times.
No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 40.5%
C: 21.6%
G: 19.0%
T: 19.0%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 2160 2537.8 0 2160
4 791 634.4 0 791
5 533 158.6 0 533
6 10 39.7 0 10
7 11 9.9 0 11
10 41 0.2 1 0 41
23 1 0.0 1 0 1
88 1 0.0 1 0 1
This pair of example FASTQs have 162416 reads each.
Looks like it's unlikely this is the problem.
I guess next step is a sanity check - grab a short substring in a region with no coverage and grep
the original data files for it.
Is it possible that we're somehow systematically throwing out the second read of every pair?
I just dug into the original FASTQ files - it looks like it's something that's happening upstream of us. Using the following script:
from pathlib import Path
import gzip
import itertools
import os
outfile = open('lengths.csv', 'w')
outfile.write("filename,length,count\n")
for path in Path('/data/wastewater/uploads/').rglob('*.fastq.gz'):
f = os.path.basename(path)
print(f)
fs = gzip.open(path, 'rt')
lengths = {}
for header, seq, _, qual in itertools.zip_longest(fs, fs, fs, fs):
l = len(seq.strip())
if l not in lengths:
lengths.update({l: 0})
lengths[l] += 1
fs.close()
for l, count in lengths.items():
outfile.write('{},{},{}\n'.format(f, l, count))
outfile.close()
Over half of the reads in the above plot are below 100nt in length.
Isaac in Trevor's lab reminded us that the amplicons are in two pools, so if there is a severe imbalance in quantification between pools then we'd get a sawtooth pattern.
Many coverage plots are turning out like this:
This doesn't make sense because the tiled amplicons should be overlapping, which should result in a more even coverage plot, like this: