Large number of (spurious ?) ASVs Novaseq

msalamon2 commented 1 month ago

Hello,

I have a dataset sequenced with Novaseq 6000 with 250PE for 77 libraries and four primers: 12S, COI, ITS and 18S, which gave ~1-6 million paired reads per library, with some lower exceptions with < 1M reads. Before running DADA2, I used cutadapt to remove primers, nextera transposase sequence and polyG tails in sequential steps (this was probably not necessary and the best approach), using the options --discard-untrimmed for the primer removing step, --nextseq-trim=20 for all the steps as it is recommended for Novaseq ,and discarding reads <10 bp with -m 10 at the end. I am worried about the high number of ASVs output by DADA2 for each primer, which is between 33K (ITS, using the ITS pipeline workflow) and 80K (18S). I used the following parameters for the filterAndTrim, changing the truncLen for each primer (except ITS) based on the length at which the quality drop in the quality profiles post-cutadapt, and the length distribution of the reads checked with fastqc (not all the reads seemed to be the same length post-cutadapt):

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN = 0, maxEE = c(2,2), truncQ=2, rm.phix = T, truncLen=c(185,185), compress=T, multithread=F)

I also used the custom Loess function (weights, span, and degree) for the error learning to accommodate for the binned quality scores, based on the solution proposed by @hhollandmoritz in issue #1307. I have also used the option pool="pseudo" for the denoising, but not for other steps.

The quality scores post-trimming and filtering looked good, as well as the error profiles and the summary table of the % of reads passing each step. Here is an example below for 12S: BCV001_12SMimammal_trimmedDADA2.pdf plot_errors_DADA2_eDNA_12SMimammal_F.pdf plot_errors_DADA2_eDNA_12SMimammal_R.pdf

(First 20 libraries) library	input	filtered	%filtered	denoisedF	%denoisedF	denoisedR	%denoisedR	merged	%merged	nonchim	%nonchim
BCV001	2823928	2667809	94.5%	2666187	99.9%	2666947	100.0%	2428653	91.1%	2384802	98.2%
BCV003	2092201	1863761	89.1%	1862701	99.9%	1862117	99.9%	1715028	92.1%	1674729	97.7%
BCV005	2493695	2027565	81.3%	2022514	99.8%	2023148	99.8%	1937985	95.8%	1848512	95.4%
BCV007	1963656	1682390	85.7%	1680199	99.9%	1680751	99.9%	1597896	95.1%	1535223	96.1%
BCV013	2148690	1897033	88.3%	1888743	99.6%	1888180	99.5%	1823632	96.6%	1703895	93.4%
BCV017	5607083	5017738	89.5%	5002841	99.7%	5000305	99.7%	4796093	95.9%	4444394	92.7%
BRB001	2586684	2335123	90.3%	2331179	99.8%	2332045	99.9%	2183002	93.6%	2094832	96.0%
BRB003	1934973	1680419	86.8%	1677359	99.8%	1678678	99.9%	1560261	93.0%	1457446	93.4%
BRB005	1901444	1704402	89.6%	1700311	99.8%	1701399	99.8%	1614046	94.9%	1544056	95.7%
BRB007	1662543	1492800	89.8%	1488955	99.7%	1490314	99.8%	1411359	94.8%	1322694	93.7%
GAU012	2850817	2566882	90.0%	2565167	99.9%	2565711	100.0%	2314591	90.2%	2263824	97.8%
GAU014	2862547	2441197	85.3%	2437344	99.8%	2440063	100.0%	2214815	90.9%	2179071	98.4%
GAU015	1768626	1669070	94.4%	1666719	99.9%	1667463	99.9%	1561142	93.7%	1497232	95.9%
GAU017	1307686	1173397	89.7%	1171754	99.9%	1172289	99.9%	1086427	92.7%	1058674	97.4%
GAU019	1726282	1546136	89.6%	1545384	100.0%	1545570	100.0%	1410556	91.3%	1394794	98.9%
GAU022	1969344	1856776	94.3%	1855450	99.9%	1854795	99.9%	1693372	91.3%	1659528	98.0%
GAU024	2128891	1857205	87.2%	1856121	99.9%	1851138	99.7%	1680081	90.5%	1656866	98.6%
GAU029	1945713	1644488	84.5%	1638064	99.6%	1637888	99.6%	1580704	96.5%	1469827	93.0%
GAU033	1656478	1466143	88.5%	1460380	99.6%	1458981	99.5%	1411894	96.7%	1324986	93.8%
GAU034	1987086	1688097	85.0%	1687322	100.0%	1687627	100.0%	1546319	91.6%	1531304	99.0%

Given this, I moved forward with the taxonomic assignment with sintax in vsearch using a cutoff of 0.8, but got relatively low % of assignments at the species/genus/family level (12-18%), even though I was using recent, broad (all Eukaryotes) reference datasets, except for the plants with ITS, which gave 84% of assignments for these taxonomic levels.

Due to the low assignment % and the high number of ASV, we are wondering if something went wrong with the trimming with cutadapt or with DADA2. After reading this issue #1609, it seems that one issue might be that using cutadapt resulted in length variation for the reads, but I am not sure how I would trim the primers + nextera transposase sequence + polyG tails without using it. Indeed, it seems that there is substantial length variation in the ASVs for all my primers (example below is for 12S after chimera removal):

185	186	187	188	189	190	191	192	193	194	195	196	197
19	87	31	142	50	100	34	10	18	78	68	28	19
198	199	200	201	202	203	204	205	206	207	208	209	210
87	647	212	289	109	820	404	20	133	10	153	5	4
211	212	213	214	215	216	217	218	219	221	222	223	224
1	4	3	1	3	2	1	5	1	6	2	9	18
225	226	227	228	229	230	231	232	233	234	235	236	237
23	244	19	12	1	5	7	4	6	9	1	1	5
238	239	240	241	242	244	245	246	247	248	249	250	251
3	9	1	7	8	8	4	3	7	42	11	26	47
252	253	254	255	256	257	258	259	260	261	262	263	264
32	108	176	450	24824	2776	246	159	132	27	50	42	16
265	266	267	268	269	270	271	272	273	274	275	276	277
6	5	10	7	3	1	1	5	3	11	6	5	3
278	279	280	281	282	283	284	285	286	287	288	289	290
6	2	1	4	6	4	1	1	4	5	2	2	1
292	293	294	295	296	297	298	299	300	301	302	303	304
1	1	3	1	2	4	2	1	3	2	3	1	2
306	307	308	309	310	311	312	313	315	316	317	318	319
4	4	3	3	4	1	7	8	7	3	2	1	2
321	324	325	326	327	328	330	331	333	335	336	337	338
2	1	1	2	3	1	1	3	2	4	4	3	1
341	342	345	346	347	348	350	354	356	357	358
1	1	3	1	2	3	2	4	1	1	1

I am still unsure if I should be interpret these results as an indication of spurious ASVs (thus resulting in the low % of assignment) due to introduced noise from the trimming with cutadapt, or if they could also be explained by the deep sequencing and the excepted complexity of the communities sampled.

Your input would be very helpful, thank you. Mathilde Salamon

benjjneb commented 1 month ago

First, "quality trimming" via cutadapt (which is the per-read trimming of tails of reads based on the read-specific quality scores) is NOT recommended for working with DADA2. It is better to handle that with a consistent across-reads truncation length, as per the truncLen approach in filterAndTrim.

That said, given the high level at which reads are passing the denoising step, this is probably not a major problem in your data.

With deep sequencing can come many types of spurious ASVs from different sources, including rare off-target amplification, rare amplification issues, and rare library artefacts among others. But if these things are rare, they are not necessarily much of a concern. The first thing to check is whether that is the case: Are these unexpected ASVs that are outside the expected length range or taxonomic ID typically rare? You've shown a table of all ASVs, but what if you weight this by their abundance? Is it almost all as expected? A simple follow-on solution if that is the case is to filter the data based on the known length distribution of the targeted amplicon to remove much of this.

In short, I don't see any obvious issue with your data from what you've posted. You probably have some spurious off-target stuff in your data at very low abundances. When looking at unique ASV counts that can seem significant, but when accounting for the abundance it usually isn't. But certainly check if my guess is the case.

msalamon2 commented 1 month ago

Hello Benjamin,

thank you for your fast answer, for having a look at my data.

I knew that trimming with cutadapt was not recommended with DADA2, but I had not realized that the flag --nextseq-trim=20 in cutadapt would do that. I am rerunning everything removing this parameter and will also use the option method="pooled" in the chimera removal step as I used the option pool="pseudo" for the denoising.

I will do the diagnostic that you proposed and use the ASV length filtering if needed, and report back the results.

Thanks, Mathilde

benjjneb commented 1 month ago

and will also use the option method="pooled" in the chimera removal step as I used the option pool="pseudo" for the denoising.

Don't do this actually! We should improve our documentation on this. method="pooled" for chimera removal should only be used when dada(..., pool=TRUE), not when dada(..., pool="pseudo").

msalamon2 commented 1 month ago

Hello Benjamin,

thank you for pointing that out for the chimera removal step, I removed it from my script. I reran DADA2 after cutting only the primers in cutadapt (no trimming), and kept the --discard-untrimmed and -m 10 options. This did not really change anything in my results, thus I ran a test for 12S MiMammal (targeting vertebrates, insert size ~ 171 bp), for which I obtained 35K ASVs for 148M reads in total. I checked the ASVs for two criterias in my results: uniqueness to each sample and length distribution.

I looked ASVs unique to one sample as it seems to be the source of the issue #1609, which was fixed after removing these unique ASVs. In my case, ASVs unique to one sample represent a relatively high % of the total number of reads (11%) and of the % of reads/samples (10% on average) depending on the samples considered. In addition, 14% of unique ASVs are assigned to species/genus/family level. I concluded that filtering out unique ASVs would thus lead to a loss of valuable information and is not a good filter in my case.

Checking the plots of the ASV length as a function of the number of reads for each ASV (1 dot/ASV), there seem to be two peaks (i.e. with numerous ASVs and reads) around 200 bp and 256 bp, with an intermediate peak at 226 bp. For the Chordates (given that my target are vertebrates), most of the ASVs with the highest number of reads/ASV seem to be clustered between 185 and 215 bp. I investigated each peak in terms of number of ASV, % of total reads and taxonomic assignments. Most ASVs > 215 bp appears to be off-target (non vertebrates). Applying an ASV length filter between 185-215 bp (200 bp +/- 20 bp) would retain a large proportion of the reads (> 80%), and result in a larger % of ASVs assigned to species/genus/family (45%), most of them assigned as Chordates. It would also retain a high % of ASVs unique to each sample, which is important for my study.

However, I am unsure about the length filtering: should I apply it after ASV clustering and merging as suggested in the tutorial with seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 180:220]), or during the filterAndTrim step before the ASV clustering. My issue is that the median ASV length seems to be ~30 bp larger than the expected insert size (200 bp vs 171 bp), so I am not sure how I would justify that. Is there a reason why the filtering is done after ASV clustering and merging in the tutorial and not during the filterAndTrim step ?

Thank you for your help, Mathilde Salamon

benjjneb commented 1 month ago

My issue is that the median ASV length seems to be ~30 bp larger than the expected insert size (200 bp vs 171 bp), so I am not sure how I would justify that.

If the expected insert size is 171 bp, and you truncated forward and reverse reads to 185 bp, then you will have ~14 X 2 = 28 bp of "read-through" due to the fact that you are keeping bases past the point where the sequenced amplicon ends. In the merged read, this will make the size of merged ASVs ~171+28=199bps, with the read-through accounting for the difference.

Maybe this is a numeric coincidence, but if the expcted insert size is 171 bp, then your truncLen should be <171, to prevent keeping bases that are past the targeted amplicon and instead are reading into the oposite primer/adapter etc.

msalamon2 commented 1 month ago

Hello Benjamin,

thank you for your explanation, I understand what is happening now. Would it make sense to filter reads between 140-200 bp (171 bp +/- 30 bp) with cutadapt before proceeding with the truncation at 171 bp in DADA2 ? This seems like the only solution to get rid of the large number of non-target ASVs > 200 bp.

Thank you, Mathilde

benjjneb commented 1 week ago

Sorry for late response.

Would it make sense to filter reads between 140-200 bp (171 bp +/- 30 bp) with cutadapt before proceeding with the truncation at 171 bp in DADA2 ?

That is reasonable.

This seems like the only solution to get rid of the large number of non-target ASVs > 200 bp.

There are other solutions too though, such as removal of off-target lengths from the final ASV table (see “cutting a band” in-silico in the DADA2 tutorial: https://benjjneb.github.io/dada2/tutorial.html)

benjjneb / dada2

Large number of (spurious ?) ASVs Novaseq #2017