intuition for appropriate filtering parameters

wdwvt1 commented 7 years ago

Working with several different sequencing runs of various qualities I am trying to develop intuition for what reasonable quality filtration parameters are in the function filterAndTrim. The relevant parameters are truncLen, maxEE, and maxN (based on my reading).

Looking through closed issues, I would summarize the guidance as follows:

Thread Maintain at least 10% of raw reads after all filtration steps (filterAndTrim as well as downstream merging). Use maxEE as the primary filtering parameter. Values from 2 to 6 are shown in various issues. Run a subset of samples through the entire pipeline to determine if full pipeline produces too few features.

Thread 60% of reads passing filterAndTrim is good.

Thread Finding the best trimming location may just mean doing a grid search.

Thread For merging reads, don't go below 8 nucleotide overlap.

Thread Chimeras as a percentage of sequences should be less than 30% generally.

My situation is the following: 300bp paired end reads (full overlap of forward and reverse read) with forward reads generally much higher quality than reverse reads. Forward read Reverse read

Forward and reverse reads seem like they ought to successfully span the full amplicon at fairly high quality using truncLen(c(225, 125). However, using truncLen(c(225, 125), maxEE=c(2,2), truncQ=11, I am losing 55-65% of sequences on average.

Two questions:

Is this a reasonable number of sequences to lose at the filter step? Do I need to do a grid search through length, maxEE, and truncQ with some sort of function that maximizes sequences without dropping quality below a certain value? What maxEE/min truncQ would you use during this search?
In your opinion, when you have fully overlapping amplicons, is paired end merging worthwhile? On one hand, given that forward reads will have low quality towards the end and reverse reads high quality at the beginning, it seems that merging ought to give me full amplicons at reasonable confidence. On the other hand, if I just used the entire forward read (or even the first 250 nt), would I capture about 90% of what I'd get with a 300bp amplicon, and significantly reduce my computational headaches?

As usual, thanks for the great tool and the helpful feedback. The issues really are nice to have around to get a handle for the thinking behind the parameters etc.

fanli-gcb commented 7 years ago

It looks like you have several things going on here. First off, the 600 cycle kits are not recommended due to quality issues: http://seqanswers.com/forums/showthread.php?t=73926

Second, how long is your amplicon? Your quality profiles suggest that you may be sequencing off the end of the template entirely.

To answer your questions directly, I would be okay with losing 55-65% of the sequences given these quality profiles. Depending on your environment, shallow-ish sequencing depth is usually enough to reach saturation anyways (we typically target 100k/sample for gut microbiome). In fact, my first pass would have been to use truncLen(c(175, 100)) but that depends on the amplicon length.

I think in general it is a good idea to do paired end merging as this gives you another opportunity to error correct.

Interesting idea with the grid search, but I'm not sure what sort of objective function would be helpful. Is maximum sequence count actually what you want? Or concordance with a known mock community, for example?

wdwvt1 commented 7 years ago

thanks for the response @fanli-gcb.

Second, how long is your amplicon? Your quality profiles suggest that you may be sequencing off the end of the template entirely.

The amplicon is 300 bp, so not sequencing off the end of the template.

To answer your questions directly, I would be okay with losing 55-65% of the sequences given these quality profiles. Depending on your environment, shallow-ish sequencing depth is usually enough to reach saturation anyways (we typically target 100k/sample for gut microbiome).

This is from a gut environment, and I'd be happy with 20,000 seqs/sample, but given the high level of multiplexing on this run I don't have a lot of wiggle room if I lose 60% of my sequences off the bat.

In fact, my first pass would have been to use truncLen(c(175, 100)) but that depends on the amplicon length.

Using these parameters I lose 40-50% of sequences. Certainly better, but still seems high. However, given that you are comfortable with this level, this might be the way I go.

It seems like the truncQ parameter is controlling my data mostly. If I turn truncQ up to 11, I retain 90% of my sequences. What is the maximum truncQ you'd be comfortable with?

fanli-gcb commented 7 years ago

It seems like the truncQ parameter is controlling my data mostly. If I turn truncQ up to 11, I retain 90% of my sequences. What is the maximum truncQ you'd be comfortable with?

truncQ and truncLen shouldn't affect the fraction of sequences retained unless the truncated reads are no longer at least minLen in length. Since the default is minLen=20, I don't see how truncQ=11 would make much of a difference in your data.

Personally, I don't place much stock in truncQ, but rely on maxEE as the primary filter. Sometimes you have positions with lower quality scores due to lower base diversity.

Think about it this way - if for example I see Q=9 at base 19 for a given read, I'm not necessarily going to think that all the subsequent bases are bad. I'd rather allow the read to be included and let the error modeling and read merging correct that base for me.

This is from a gut environment, and I'd be happy with 20,000 seqs/sample, but given the high level of multiplexing on this run I don't have a lot of wiggle room if I lose 60% of my sequences off the bat.

Any chance you can resequence? That's preferable and cheaper to drowning out your biological signal with bad data. :P

wdwvt1 commented 7 years ago

truncQ and truncLen shouldn't affect the fraction of sequences retained unless the truncated reads are no longer at least minLen in length. Since the default is minLen=20, I don't see how truncQ=11 would make much of a difference in your data.

I am not sure why it would either - but it certainly makes a much larger difference than maxEE. I don't think I understand what is happening with this parameter. For example, using the commands listed below with my dataset I get the following

`truncQ`	`minLen`	num output sequences
2	20	42350
11	20	18237
25	20	5371
2	100	42350
11	100	18237
25	100	5371
2	126	0
11	126	0
25	126	0

When the minLen is 126 I get no sequences. That makes sense, I am truncating the reverse read at 125. However, the fact that at any other minLen the number of sequences is the same is very strange. For instance, with q=2, whether I set minLen at 20 or 100 I get the same number of sequences. This means there are no reads which have a quality 2 base between 20 and 100 base pairs xor q=2 before 20 bases. The same phenomenon is true at q=25 and q=30 (didn't show this). This seems unlikely.

I am using R 3.3.1, dada2 1.3.5, Rcpp .12.10.

sessionInfo()
R version 3.3.1 (2016-06-21)
...
other attached packages:
[1] dada2_1.3.5  Rcpp_0.12.10

# minLen=20
# truncQ=2
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 42350 (90.1%) filtered paired-sequences.

# minLen=100
# truncQ=2
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 42350 (90.1%) filtered paired-sequences.

# minLen=126
# truncQ=2
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 0 (0%) filtered paired-sequences.

Read in 46994 paired-sequences, output 42350 (90.1%) filtered paired-sequences.
# minLen=20
# truncQ=11
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=11, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 18237 (38.8%) filtered paired-sequences.

# minLen=100
# truncQ=11
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=11, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=100)
Read in 46994 paired-sequences, output 18237 (38.8%) filtered paired-sequences.

# minLen=126
# truncQ=11
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=11, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=126)
Read in 46994 paired-sequences, output 0 (0%) filtered paired-sequences.

# minLen=20
# truncQ=25
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=25, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 5371 (11.4%) filtered paired-sequences.

# minLen=100
# truncQ=25
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=25, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=100)
Read in 46994 paired-sequences, output 5371 (11.4%) filtered paired-sequences.

# minLen=126
# truncQ=25
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=25, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=126)
Read in 62583 paired-sequences, output 0 (0%) filtered paired-sequences.

fanli-gcb commented 7 years ago

Agree, definitely seems unlikely. Can you share a fastq file that results in this behavior? I'm looking through the fastqFilter code and seems to me that minLen is being correctly enforced.

wdwvt1 commented 7 years ago

@fanli-gcb - thanks very much for taking a look. where should i send the files?

fanli-gcb commented 7 years ago

You can directly attach a small gzipped fastq file here. Perhaps a small reproducible example would be best.

benjjneb commented 7 years ago

I could probably write several paragraphs on my opinions on the best way to filter, but I think the most important thing to understand is that there are multiple right ways to filter, and the number one goal is to avoid doing it wrong. Optimizing is nice, but as @fanli-gcb mentioned, it is not always obvious what the right objective function to optimize is.

On the minLen parameter: minLen is redundant with truncLen. truncLen both truncates to the specified length, and throws away any sequences <truncLen. minLen does just the 2nd part of that. For Ilumina 16S sequencing, I don't think there is any reason to use minLen.

On setting quality parameters: In general we recommend using maxEE as the primary quality filter. truncQ is mostly to remove very low quality sequences, and is included as a defualt in part because in older Illumina (Casava <1.8) Q=2 was a special value that indicated you should stop using the sequence.

In general I (1) pick truncLen parameters that avoid the worst parts of the quality profiles but ensure that enough sequence is kept to healthily overlap (truncLen[[1]] + truncLen[[2]] > amplicon_length+25), and err a bit on the more sequence side, (2) leave truncQ=2, and (3) try a couple maxEE values until I get a satisfactory number of reads through the filter. Eyeballing the posted profile, I might start with (truncLen=c(240,160), truncQ=2, maxEE=2) and then potentially relax maxEE a bit.

On using just the forwards read: That is totally reasonable, in our tests dada2 has done really well with just forwards reads, and when reverse reads are bad enough they will cost more in sensitivity to low-frequency stuff than they add in a lower FP rate. Here you have enough high-quality reverse sequence that I think the reverse reads are worth using, but truncating the forwards reads at say 275 would also work well.

A final consideration: It is easier to combine data from different studies using the same primer set, if you kept the whole amplicon. So merging the paired reads makes the processed data a bit more reusable in the future.

wdwvt1 commented 7 years ago

@fanli-gcb @benjjneb Thanks very much for the suggestions; this was very helpful.

I could probably write several paragraphs on my opinions on the best way to filter, but I think the most important thing to understand is that there are multiple right ways to filter, and the number one goal is to avoid doing it wrong. Optimizing is nice, but as @fanli-gcb mentioned, it is not always obvious what the right objective function to optimize is.

Is this something you are planning on doing at some point? I think it would be a tremendous help to the community. One of my frustrations in writing code for QIIME 1 was always that we had insufficient intuition building documentation/examples. Knowing what a parameter mechanically does it very distinct from knowing where varying that parameter falls in the decision tree.

benjjneb commented 7 years ago

Is this something you are planning on doing at some point? I think it would be a tremendous help to the community. One of my frustrations in writing code for QIIME 1 was always that we had insufficient intuition building documentation/examples. Knowing what a parameter mechanically does it very distinct from knowing where varying that parameter falls in the decision tree.

It's a good idea.

fanli-gcb commented 7 years ago

A final consideration: It is easier to combine data from different studies using the same primer set, if you kept the whole amplicon. So merging the paired reads makes the processed data a bit more reusable in the future.

I was just thinking about this today while merging a few runs together. Also from the big data tutorial:

Note: The trimming parameters of different runs should be identical if you want to simply merge them together later (otherwise the sequences aren’t directly comparable).

Do you think the exact trimming parameter need to be identical to merge runs, or is it sufficient for the merged amplicons represent the same region of the 16S gene? In other words, I think you'd need trimLeft to be consistent, but truncLen could vary as long as you have enough remaining to form merged contigs.

benjjneb commented 7 years ago

Do you think the exact trimming parameter need to be identical to merge runs, or is it sufficient for the merged amplicons represent the same region of the 16S gene? In other words, I think you'd need trimLeft to be consistent, but truncLen could vary as long as you have enough remaining to form merged contigs.

Yep, you are right. Updated the Big Data tutorial as suggested: 8e051c44e44e5beaf221632f5dc5d435831211e0

joey711 commented 7 years ago

If you're merging paired reads from both ends, then the requirement for merging abundance data across several runs is only that they are amplified from the same locus, since the read-pair merge implies that the whole amplicon is included, regardless of the specific trimming parameters.

Different filtering parameters applied to different runs could introduce run-specific bias, so probably should not change that.

There's always a consideration of merging "shiftmeras", sequences that are actually the same except for a small shift on either end. This is true even within a run, though.

SimonMorvan commented 6 years ago

I started my MsC in september and I have to use a pipeline for 2x300 MiSeq data (V3-V4 & ITS) from soil microbiome. I've been using the Dada2 pipeline as I was beginning to use R, I found it very convenient (I have no experience with other pipelines).

I've been searching for the appropriate filter & trim parameters and the multiple answers I found, thanks to this Github, helped me. I am still wondering if there is some sort of Q score threshold that is usually used as an indicator of a good filter & trim procedure.

What I mean is, what should a filtFs / filtRs quality profile look like?

benjjneb commented 6 years ago

@SimonMorvan Sorry missed this one a little bit.

There is no one threshold, and the trimming params are something that are best chosen by looking at the quality profile of your data and picking what's appropriate.

The goal is just to pick something that works well -- usually there is a range of right choices. In order, the things that you need to meet are: (1) Maintain overlap after truncation, if you don't have 20 nts of overlap after the truncLen is enforced, the merging will fail, (2) truncate off the low-quality tails, it is common for quality crashes to occurs in Illumina sequencing, especially reverse reads, and it is best to trim those off as much as you can.

SimonMorvan commented 6 years ago

Thanks for your answer @benjjneb, My forward reads after filt&trim have a min Qscore of around 30 while my reverse reads have a min Qscore of around 25. In the end, I have 50 000 - 60 000 non chimeric reads. Should I be more stringent in the filt&trim parameters?

benjjneb commented 6 years ago

Can you state the basic information on your amplicon sequencing setup?

What is the size of your amplicon (or what primers are you using)? Are the primers on the reads? What are the plotQualityProfile plots from the F/R reads?

SimonMorvan commented 6 years ago

@benjjneb Illumina MiSeq 2x300 Amplified region: 16S V3-V4

Primers Forward : 305F : CCTACGGGNGGCWGCAG 17nts Reverse : 801R : GACTACHVGGGTATCTAATCC : 21 nts I removed them as it was advised in the removing chimera part

The plotQualityProfile plots from the raw reads : Forward reads: rplot-rawfs Reverse reads: rplot-rawrs

Filter the forward and reverse reads

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, 
                     truncLen=c(275,250), ## Cuts the forward and reverse reads at desired lengths
                     trimLeft=c(17,21), ## Removes the primers 
                     maxN=0,
                     maxEE=c(3,5), 
                     truncQ=2, 
                     rm.phix=TRUE, 
                     compress=TRUE, 
                     multithread=TRUE) ## On Windows set multithread=FALSE

The plotQualityProfile plots from the filt&trimmed reads : Forward reads: rplot-filtfs Reverse reads: rplot-filtrs

mergers20 <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap=20, verbose=TRUE)

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

42,3 % of chimeric sequences which account for about 16,4% of the total sequence reads.

> head(track)
       input filtered denoised merged tabled nonchim
SB_8  168857   107518   107518  68189  68189   54476
SB_27 168075    95209    95209  66930  66930   60890
SB_33 181031   115789   115789  74811  74811   60120

Thank you for your time!

fanli-gcb commented 6 years ago

That looks pretty reasonable to me. There isn't much more trimming you can do as it appears you are at the limit for 20bp overlap. The maxEE setting for R2 is a bit high IMO - I'd try leaving it at 3 as well and see how many more sequences you lose. Generally prefer to be conservative with discarding sequences as it appears you have plenty of data left over anyways.

benjjneb commented 6 years ago

Agree with @fanli-gcb that looks pretty reasonable.

The one thing I might be slightly concerned about is the fraction of reads successfully merging (only ~65%), and whether some are being lost due to insufficient overlap. That's also because your sequenced amplicon is about 500nts, and you've trimmed to 275+250=525, which is close to the 20nt minimum overlap, and hence shorter than average sequences might be getting lost in merging.

I'd try the following on a subset of samples, and see if a larger fraction of reads get merged: truncLen=c(280,255) and minOverlap=12. If not, I'd just go with what you've got already.

SimonMorvan commented 6 years ago

@fanli-gcb @benjjneb
This is what I obtain using a 20 nt minimum overlap. But as you can see, I obtained the exact same results using a 50 nt minimum overlap.

summary(mergers20$SB_27$nmatch) Min. 1st Qu. Median Mean 3rd Qu. Max. 57.0 63.0 84.0 76.3 85.0 103.0 `

summary(mergers20$SB_33$nmatch) Min. 1st Qu. Median Mean 3rd Qu. Max. 57.00 71.00 85.00 78.19 85.00 153.00

summary(mergers20$SB_8$nmatch) Min. 1st Qu. Median Mean 3rd Qu. Max. 57.00 73.00 85.00 79.21 85.00 149.00

SimonMorvan commented 6 years ago

Hello @fanli-gcb @benjjneb, I applied the modifications that you advised. Here is what I obtained.

Method 1: In this method, I used truncLen=c(280,255) and minoverlap=12. (I used truncLen=c(275,250) and minoverlap=20 in previous method). I left maxEE unchanged (3,5)

As you predicted @benjjneb, the proportion of successfully merged reads increased but it seems that it wasn't due to the reduced overlap as the minimum overlap in the 3 samples is 67 nt.

Merging

Final:

	input	Filter and denoised	%/input	Merged	%/input	%/filtered	Non chimeras	%/input	%/merged
SB8	168857	105765	57%	66736	40%	70%	54309	32%	80%
SB27	168075	93654	51%	65707	40%	78%	59993	36%	91%
SB33	181031	113858	57%	72955	41%	72%	59878	33%	80%

vs what i had before:

	input	Filter and denoised	%/input	Merged	%/input	%/filtered	Non chimeras	%/input	%/merged
SB8	168857	107518	64%	68189	40%	63%	54476	32%	80%
SB27	168075	95209	57%	66930	40%	70%	60890	36%	91%
SB33	181031	115789	64%	74811	41%	65%	60120	33%	80%

Method 2:

In this method, I reduced the MaxEE for the reverse (maxEE=(3,3)) as @fanli-gcb suggested and I used the same parameters as the first method (truncLen=c(280,255) and minOverlap=12).

With this method, I lose some reads but the proportions kept in each step are roughly the same as the first method I used. The quality must have improved although it is not quite visible on the quality profiles.

Forward filtered reads

Reverse filtered reads

Merging

Final :

	input	Filter and denoised	%/input	Merged	%/input	%/filtered	Non chimeras	%/input	%/merged
SB8	168857	96832	57%	61152	36%	63%	50005	30%	82%
SB27	168075	85938	51%	60393	36%	70%	55300	33%	92%
SB33	181031	103932	57%	66978	37%	64%	55832	31%	83%

vs what I had before:

	input	Filter and denoised	%/input	Merged	%/input	%/filtered	Non chimeras	%/input	%/merged
SB8	168857	107518	64%	68189	40%	63%	54476	32%	80%
SB27	168075	95209	57%	66930	40%	70%	60890	36%	91%
SB33	181031	115789	64%	74811	41%	65%	60120	33%	80%

In your opinion, is there one "better" than the other?

jeremycfd commented 6 years ago

Hi @benjjneb I came to this thread searching for info on how truncLen treats sequences of lengths <truncLen and found your explanation:

truncLen both truncates to the specified length, and throws away any sequences <truncLen.

Is there a particular reason that truncLen throws out such sequences? I've noticed in some samples there are amplicons much smaller than the majority of amplicons; they turn out to be host contamination, but I feel like I should exclude such contamination after taxonomy assignment rather than this early in the analysis. Do you have any thoughts about that?

Thanks!

benjjneb commented 6 years ago

Is there a particular reason that truncLen throws out such sequences?

Good reason: That is usually the desirable behavior and in general we would recommend it.

Historical reason: The original version of dada2 required same-length sequences, but that's no longer the case.

I've noticed in some samples there are amplicons much smaller than the majority of amplicons; they turn out to be host contamination, but I feel like I should exclude such contamination after taxonomy assignment rather than this early in the analysis. Do you have any thoughts about that?

It's fair to want to do it that way, and I agree it can be useful in some cases. We are planning on adding a trimRight parameter to filtering, and I think this is another reason to do so: https://github.com/benjjneb/dada2/issues/471#issuecomment-389569776

rrohwer commented 6 years ago

Is there a way to truncate based on expected errors instead of minimum quality score?

I find (as above) that setting a minimum quality with truncQ is much more stringent and intuitively less relevant than filtering based on expected errors. However, the maxEE filtering is dependent on the sequence length because if you set your truncLen too long you will have a lot of expected errors due to poor quality bases at the end. It would be nice to go until you hit >= 2 maxEE, and truncate the sequences there. Then you could choose the truncLen based on the length distribution of only not-terrible sequences, which would also make the median quality score metric more relevant (not dragged down by seqs you will toss no matter what).

As is, I usually approximate this in multiple steps- choosing several truncLen options and seeing how many reads are retained for each at my maxEE cutoff. However, a "truncE" would streamline this decision process, and provide a view of sequence quality based only on reasonable reads.

benjjneb commented 6 years ago

I find (as above) that setting a minimum quality with truncQ is much more stringent and intuitively less relevant than filtering based on expected errors. However, the maxEE filtering is dependent on the sequence length because if you set your truncLen too long you will have a lot of expected errors due to poor quality bases at the end. It would be nice to go until you hit >= 2 maxEE, and truncate the sequences there.

There is a reason that neither truncQ or truncEE would be advisable as the main filtering strategy in most circumstances: The core algorithm is relying on repeated observations of the exact same sequence to identify biological variants, and "ragged" filtering splits identical variants into multiple "truncation alleles" based on where exactly they fell below, e.g., a truncEE cutoff. While the algorithm will recombine those alleles, it still loses sensitivity to rarer variants. For example, a sequence variant supported by 10 reads could be split into 10 different truncation alleles, and then it would not be detected as each alone is a singleton.

Then you could choose the truncLen based on the length distribution of only not-terrible sequences, which would also make the median quality score metric more relevant (not dragged down by seqs you will toss no matter what).

As is, I usually approximate this in multiple steps- choosing several truncLen options and seeing how many reads are retained for each at my maxEE cutoff. However, a "truncE" would streamline this decision process, and provide a view of sequence quality based only on reasonable reads.

This is an interesting idea though. I think the way we would implement it would be as a maxEEstats informational function, i.e. it wouldn't be rolled into the filtering function to avoid ragged filtering, but would inform the choice of filtering parameters.

I like it, we should do something like this.

adityabandla commented 6 years ago

USEARCH has had the maxEEstats function for quite sometime now. I always use that to decide on truncLen and maxEE, ofcourse would be great to have it as part of DADA2 as well

RemiMaglione commented 6 years ago

Hi everyone, based on everything I read here, I wrote, several month ago, a r-scripts for my lab mates https://github.com/RemiMaglione/r-scripts/blob/master/Qual_vs_MaxEE_plot.R This r-scripts is a kind of decision tools that can help you to visually decide what parameters to choose for both MaxEE and TruncLen. It's based on reads quality, extracted with fastQC, and It produce this kind of output: plot_ouput_zoom

Since I never stopped using it, I thought I had to share it with you. Enjoy !

@benjjneb : I didn't know if I had to create a new threads, I though it was a good idea to drop into this thread since everything starts here. Feel free to move it elsewhere ;) And btw, in the same spirit of this script, it could be interesting to have a dada2 function that draw the maxEE plot like the plotQualityProfile() function: plotMaxEEProfile() or plotQualityProfile(..., MaxEE =TRUE) ?

fanli-gcb commented 6 years ago

Very cool plots @RemiMaglione

At least in the case of this particular data, my interpretation is that increasing maxEE doesn't buy you a whole lot of additional reads. Trimming would be much more effective.

Sad to see that the 2x300bp reads are still so dicey

benjjneb commented 6 years ago

@RemiMaglione That is pretty awesome.

it could be interesting to have a dada2 function that draw the maxEE plot like the plotQualityProfile() function: plotMaxEEProfile() or plotQualityProfile(..., MaxEE =TRUE) ?

Agreed, your plots make a pretty compelling case for adding something like that.

my interpretation is that increasing maxEE doesn't buy you a whole lot of additional reads. Trimming would be much more effective.

This is usually the case just due to the nature of the Illumina quality "crash" that often occurs at the end of reads (especially 2x300 kit).

LucianaPMora commented 3 years ago

Hello, your information has been very useful. In my case, I had to choose between letting pass the filterAndTrim function around 75% of the raw sequences with a 120b length of the Reverse sequences (bad quality) or letting pass 85% of the raw sequences with 90b length of the Reverse sequences. At the end, I decided to use the first filter so 75% of the raw sequences passed the FilterAndTrim function. After merging, I got around 40-50% of the raw sequences remain. Do you think that is ok? or should I be less strict with the filtering steps? Could you send me some good paper so I can read about it? My samples are from bacterial 16S ARNr Illumina Miseq and the amplicon length is 250 bp. Thank you all.

benjjneb commented 3 years ago

@LucianaPMora There's nothing wrong with stricter filtering when you are just giving up a marginal (e.g. 10%) of reads in the process. More lower quality reads doesn't really help anything at the end (remember, this is all relative abundance data anyway).

LucianaPMora commented 3 years ago

Thank you for your answer. I think the same too, I performed indeed the same analysis with no expected errors filtering and the main conclusions are equal. I just feel like I am loosing too much information if I kept just 40-50% of the raw sequences after merging, but I understand what you said about the quality.

In my case, I have really good qualities in the F sequences and their lenght is enough to complete the amplicon. I wonder if I should shorten the R sequences so that more R sequences can pass the filter (even if they are 90 bases long) with good quality and perhaps I can get more than 50% sequences after merging. I just dont know how many reads usually pass this filters, so I am not sure if this is normal to happen.

Thank you very much again, this thread has been very useful.

benjjneb / dada2

intuition for appropriate filtering parameters #236

Filter the forward and reverse reads