benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

only few merged reads remain #980

Closed HenrikEckermann closed 4 years ago

HenrikEckermann commented 4 years ago

Hi,

I have read the FAQ and saw this question is addressed but I do not know how to proceed.

I have 742 paired end samples from Ilumnina HiSeq runs that I demultiplexed. Barcodes and primers (515F - 806Rd) are removed and I have one fastq.gz file per sample as in the dada2 tutorial. This is the first time that I process raw data for taxonomic assignment. I walked through the exact steps of the tutorial once with the example data and the with my data.

I attach some quality plots as example.

I used two strategies (bases on the FAQ). For the first attempt, I used filterAndTrim with the settings as in the tutorial. Out of the e.g. 237219 reads there remain a little less after filtering/trimming etc. But after merging it drops to 0-50. Then I tried without truncating in the hope that this might be the issue. But again only few sequences are left after merging. What can I do to find the problem?

edit: I also attached the error plot.

Screenshot 2020-03-25 at 11 41 33 Screenshot 2020-03-25 at 11 41 18 Screenshot 2020-03-25 at 11 40 40 Screenshot 2020-03-25 at 11 40 19 Screenshot 2020-03-25 at 13 23 40
benjjneb commented 4 years ago

My strong suspicion is that the amplicon you have sequenced is not actually 515F/806R, but is something else.

Could you post the output of the following, on the raw (i.e. unfiltered/untrimmed) sequencing data from an example sample?

dada2:::pfasta(getSequences(derepFastq("path/to/example_R1.fastq.gz"))[1:5])
dada2:::pfasta(getSequences(derepFastq("path/to/example_R2.fastq.gz"))[1:5])
HenrikEckermann commented 4 years ago

Hi benjjneb,

thanks so much for support.

Here the output:

Forward library 1

>1
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGAAGCCTATCTCGTATGCCGTCTTCTGCTTGAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGAAGCCTATCGCGTATGCCGTCTTCTGCTTGAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>3
GGTAGAATGTGTGTCAGCCGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGA
>4
AAGGAGCGGTGTGCCAGCCGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGA
>5
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGAAGCCTATCTCGTATGCCGTCTTCTGCTTGAACAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Reverse library 1:

>1
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>2
CATAAGCGGTGTGCCAGCCGCCGCGGTAATACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGA
>3
CATAAGCGGTGTGTCAGCCGCCGCGGTAATACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGA
>4
CGCAAGCTGTGTGTCAGCCGCCGCGGTAATACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGA
>5
AAGGAGCGGTGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGA
benjjneb commented 4 years ago

A few things going on here. First, as is clear from the above, you have some significant low-complexity infiltration in your libraries. You can inspect this more using plotComplexity("path/to/example.fastq.gz").

Second, these forward and reverse sequences do not appear to be in consistent expected orientations. For example, forward read 3 and Reverse read 5 are the same sequence over the non-barcode-primer region. What orientations of reads did this library preparation produce?

Third, these are 2x151 reads, so 302 total nucleotides. The primers are sequenced, and there also appears to be an additional 10nts sequence in front of the primers (barcode? adapter?). So, you're "sequenced amplicon" is ~252nts (V4) + ~20nts (primers) + ~20 nts (nts before primers) = ~310 nts. That is, they are longer than your total read length, and your reads will not overlap.

You could try to just analyze the forward reads alone, assuming the low-complexity infiltration is not so bad it overwhelms the libraries entirely. Or you could redo the sequencing with a an appropriate sequencing/library method.

HenrikEckermann commented 4 years ago

Edit:

Sorry, I misunderstood what you wanted as output. Above the output is from the fastq.gz file that is not yet demultiplexed. This mean it was an entire library instead of one sample, which is what you asked for. See below the same output for an example sample. I also attached the complexity plots for both (sample fastq file and whole library fastq file)

sample 1 forward (primers and barcodes removed)

>1
TCCTGTTTGCTACCCATGCTTTCGAGCCTCAGCGTCAGTTGGTGCCCAGTAGACCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTCCACCGCTACACCGGGAATTCCATCTACC
>2
AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAAACCATGGGCTCAACCCATGAATTGCTTTCAAAACTGTTTTTCTTGA
>3
TCCTGTTTGCTCCCCACGCTTTCGAGCCTCAACGTCAGTTACCGTCCAGTAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTACC
>4
TCCTGTTTGCTACCCACACTTTCGAGCCTCAGCGTCAGTTGGTGCCCAGTAGGCCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTCCACCGCTACACCGGGAATTCCGCCTACC
>5
TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGA

sample 1 reverse (primers and barcode removed)

>1
CCTGTTTGCTACCCATGCTTTCGAGCCTCAGCGTCAGTTGGTGCCCAGTAGACCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTCCACCGCTACACCGGGAATTCCATCTACC
>2
ACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAAACCATGGGCTCAACCCATGAATTGCTTTCAAAACTGTTTTTCTTGA
>3
CCTGTTTGCTCCCCACGCTTTCGAGCCTCAACGTCAGTTACCGTCCAGTAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATATCTACGCATTTCACCGCTACACTAGGAATTCCGCTTACC
>4
CCTGTTTGCTACCCACACTTTCGAGCCTCAGCGTCAGTTGGTGCCCAGTAGGCCGCCTTCGCCACTGGTGTTCCTCCCGATATCTACGCATTCCACCGCTACACCGGGAATTCCGCCTACC
>5
ACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGA

Complexity for this one sample:

plotComplexity(sample_1_F)
Screenshot 2020-03-26 at 16 39 13

Complexity for the entire first library (forward reads, primers and barcodes present)

plotComplexity(library_1_F)
Screenshot 2020-03-26 at 16 41 41

P.S.:

I will figure out the other questions with my collaborators who sequenced the samples.

benjjneb commented 4 years ago

Yes it is OK to look at the primer-removed files as well. There you see your forward reads are 121 nts and reverse reads are 122 nts, which is not enough to overlap with each other, since the targeted V4 amplicon is ~252nts, and 252 > 121+122.

Given that the demultiplexing seems to remove the low-complexity part of the data (i.e. the secondary mode in the library plot), you are probably fine to go forward with this data, but just using the forward reads.

HenrikEckermann commented 4 years ago

Thanks a lot for your time and effort benjjneb. With that information I can go on!