Open wdwvt1 opened 7 years ago
It looks like you have several things going on here. First off, the 600 cycle kits are not recommended due to quality issues: http://seqanswers.com/forums/showthread.php?t=73926
Second, how long is your amplicon? Your quality profiles suggest that you may be sequencing off the end of the template entirely.
To answer your questions directly, I would be okay with losing 55-65% of the sequences given these quality profiles. Depending on your environment, shallow-ish sequencing depth is usually enough to reach saturation anyways (we typically target 100k/sample for gut microbiome). In fact, my first pass would have been to use truncLen(c(175, 100))
but that depends on the amplicon length.
I think in general it is a good idea to do paired end merging as this gives you another opportunity to error correct.
Interesting idea with the grid search, but I'm not sure what sort of objective function would be helpful. Is maximum sequence count actually what you want? Or concordance with a known mock community, for example?
thanks for the response @fanli-gcb.
Second, how long is your amplicon? Your quality profiles suggest that you may be sequencing off the end of the template entirely.
The amplicon is 300 bp, so not sequencing off the end of the template.
To answer your questions directly, I would be okay with losing 55-65% of the sequences given these quality profiles. Depending on your environment, shallow-ish sequencing depth is usually enough to reach saturation anyways (we typically target 100k/sample for gut microbiome).
This is from a gut environment, and I'd be happy with 20,000 seqs/sample, but given the high level of multiplexing on this run I don't have a lot of wiggle room if I lose 60% of my sequences off the bat.
In fact, my first pass would have been to use truncLen(c(175, 100)) but that depends on the amplicon length.
Using these parameters I lose 40-50% of sequences. Certainly better, but still seems high. However, given that you are comfortable with this level, this might be the way I go.
It seems like the truncQ
parameter is controlling my data mostly. If I turn truncQ
up to 11, I retain 90% of my sequences. What is the maximum truncQ
you'd be comfortable with?
It seems like the truncQ parameter is controlling my data mostly. If I turn truncQ up to 11, I retain 90% of my sequences. What is the maximum truncQ you'd be comfortable with?
truncQ
and truncLen
shouldn't affect the fraction of sequences retained unless the truncated reads are no longer at least minLen
in length. Since the default is minLen=20
, I don't see how truncQ=11
would make much of a difference in your data.
Personally, I don't place much stock in truncQ
, but rely on maxEE
as the primary filter. Sometimes you have positions with lower quality scores due to lower base diversity.
Think about it this way - if for example I see Q=9
at base 19 for a given read, I'm not necessarily going to think that all the subsequent bases are bad. I'd rather allow the read to be included and let the error modeling and read merging correct that base for me.
This is from a gut environment, and I'd be happy with 20,000 seqs/sample, but given the high level of multiplexing on this run I don't have a lot of wiggle room if I lose 60% of my sequences off the bat.
Any chance you can resequence? That's preferable and cheaper to drowning out your biological signal with bad data. :P
truncQ and truncLen shouldn't affect the fraction of sequences retained unless the truncated reads are no longer at least minLen in length. Since the default is minLen=20, I don't see how truncQ=11 would make much of a difference in your data.
I am not sure why it would either - but it certainly makes a much larger difference than maxEE
. I don't think I understand what is happening with this parameter. For example, using the commands listed below with my dataset I get the following
truncQ |
minLen |
num output sequences |
---|---|---|
2 | 20 | 42350 |
11 | 20 | 18237 |
25 | 20 | 5371 |
2 | 100 | 42350 |
11 | 100 | 18237 |
25 | 100 | 5371 |
2 | 126 | 0 |
11 | 126 | 0 |
25 | 126 | 0 |
When the minLen
is 126 I get no sequences. That makes sense, I am truncating the reverse read at 125. However, the fact that at any other minLen
the number of sequences is the same is very strange. For instance, with q=2
, whether I set minLen
at 20 or 100 I get the same number of sequences. This means there are no reads which have a quality 2 base between 20 and 100 base pairs xor q=2
before 20 bases. The same phenomenon is true at q=25 and q=30 (didn't show this). This seems unlikely.
I am using R 3.3.1, dada2 1.3.5, Rcpp .12.10.
sessionInfo()
R version 3.3.1 (2016-06-21)
...
other attached packages:
[1] dada2_1.3.5 Rcpp_0.12.10
# minLen=20
# truncQ=2
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 42350 (90.1%) filtered paired-sequences.
# minLen=100
# truncQ=2
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 42350 (90.1%) filtered paired-sequences.
# minLen=126
# truncQ=2
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 0 (0%) filtered paired-sequences.
Read in 46994 paired-sequences, output 42350 (90.1%) filtered paired-sequences.
# minLen=20
# truncQ=11
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=11, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 18237 (38.8%) filtered paired-sequences.
# minLen=100
# truncQ=11
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=11, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=100)
Read in 46994 paired-sequences, output 18237 (38.8%) filtered paired-sequences.
# minLen=126
# truncQ=11
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=11, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=126)
Read in 46994 paired-sequences, output 0 (0%) filtered paired-sequences.
# minLen=20
# truncQ=25
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=25, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=20)
Read in 46994 paired-sequences, output 5371 (11.4%) filtered paired-sequences.
# minLen=100
# truncQ=25
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=25, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=100)
Read in 46994 paired-sequences, output 5371 (11.4%) filtered paired-sequences.
# minLen=126
# truncQ=25
filterAndTrim(fwd=file.path(fwd_path, fwd_fastq), filt=file.path(fwd_filt_path, fwd_fastq), rev=file.path(rev_path, rev_fastq), filt.rev=file.path(rev_filt_path, rev_fastq), truncLen=c(225, 125), maxEE=c(2,2), truncQ=25, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=FALSE, minLen=126)
Read in 62583 paired-sequences, output 0 (0%) filtered paired-sequences.
Agree, definitely seems unlikely. Can you share a fastq file that results in this behavior? I'm looking through the fastqFilter
code and seems to me that minLen
is being correctly enforced.
@fanli-gcb - thanks very much for taking a look. where should i send the files?
You can directly attach a small gzipped fastq file here. Perhaps a small reproducible example would be best.
I could probably write several paragraphs on my opinions on the best way to filter, but I think the most important thing to understand is that there are multiple right ways to filter, and the number one goal is to avoid doing it wrong. Optimizing is nice, but as @fanli-gcb mentioned, it is not always obvious what the right objective function to optimize is.
On the minLen
parameter: minLen
is redundant with truncLen
. truncLen
both truncates to the specified length, and throws away any sequences <truncLen
. minLen
does just the 2nd part of that. For Ilumina 16S sequencing, I don't think there is any reason to use minLen
.
On setting quality parameters: In general we recommend using maxEE
as the primary quality filter. truncQ
is mostly to remove very low quality sequences, and is included as a defualt in part because in older Illumina (Casava <1.8) Q=2
was a special value that indicated you should stop using the sequence.
In general I (1) pick truncLen
parameters that avoid the worst parts of the quality profiles but ensure that enough sequence is kept to healthily overlap (truncLen[[1]] + truncLen[[2]] > amplicon_length+25
), and err a bit on the more sequence side, (2) leave truncQ=2
, and (3) try a couple maxEE
values until I get a satisfactory number of reads through the filter. Eyeballing the posted profile, I might start with (truncLen=c(240,160), truncQ=2, maxEE=2
) and then potentially relax maxEE
a bit.
On using just the forwards read: That is totally reasonable, in our tests dada2 has done really well with just forwards reads, and when reverse reads are bad enough they will cost more in sensitivity to low-frequency stuff than they add in a lower FP rate. Here you have enough high-quality reverse sequence that I think the reverse reads are worth using, but truncating the forwards reads at say 275
would also work well.
A final consideration: It is easier to combine data from different studies using the same primer set, if you kept the whole amplicon. So merging the paired reads makes the processed data a bit more reusable in the future.
@fanli-gcb @benjjneb Thanks very much for the suggestions; this was very helpful.
I could probably write several paragraphs on my opinions on the best way to filter, but I think the most important thing to understand is that there are multiple right ways to filter, and the number one goal is to avoid doing it wrong. Optimizing is nice, but as @fanli-gcb mentioned, it is not always obvious what the right objective function to optimize is.
Is this something you are planning on doing at some point? I think it would be a tremendous help to the community. One of my frustrations in writing code for QIIME 1 was always that we had insufficient intuition building documentation/examples. Knowing what a parameter mechanically does it very distinct from knowing where varying that parameter falls in the decision tree.
Is this something you are planning on doing at some point? I think it would be a tremendous help to the community. One of my frustrations in writing code for QIIME 1 was always that we had insufficient intuition building documentation/examples. Knowing what a parameter mechanically does it very distinct from knowing where varying that parameter falls in the decision tree.
It's a good idea.
A final consideration: It is easier to combine data from different studies using the same primer set, if you kept the whole amplicon. So merging the paired reads makes the processed data a bit more reusable in the future.
I was just thinking about this today while merging a few runs together. Also from the big data tutorial:
Note: The trimming parameters of different runs should be identical if you want to simply merge them together later (otherwise the sequences aren’t directly comparable).
Do you think the exact trimming parameter need to be identical to merge runs, or is it sufficient for the merged amplicons represent the same region of the 16S gene? In other words, I think you'd need trimLeft
to be consistent, but truncLen
could vary as long as you have enough remaining to form merged contigs.
Do you think the exact trimming parameter need to be identical to merge runs, or is it sufficient for the merged amplicons represent the same region of the 16S gene? In other words, I think you'd need trimLeft to be consistent, but truncLen could vary as long as you have enough remaining to form merged contigs.
Yep, you are right. Updated the Big Data tutorial as suggested: 8e051c44e44e5beaf221632f5dc5d435831211e0
If you're merging paired reads from both ends, then the requirement for merging abundance data across several runs is only that they are amplified from the same locus, since the read-pair merge implies that the whole amplicon is included, regardless of the specific trimming parameters.
Different filtering parameters applied to different runs could introduce run-specific bias, so probably should not change that.
There's always a consideration of merging "shiftmeras", sequences that are actually the same except for a small shift on either end. This is true even within a run, though.
I started my MsC in september and I have to use a pipeline for 2x300 MiSeq data (V3-V4 & ITS) from soil microbiome. I've been using the Dada2 pipeline as I was beginning to use R, I found it very convenient (I have no experience with other pipelines).
I've been searching for the appropriate filter & trim parameters and the multiple answers I found, thanks to this Github, helped me. I am still wondering if there is some sort of Q score threshold that is usually used as an indicator of a good filter & trim procedure.
What I mean is, what should a filtFs / filtRs quality profile look like?
@SimonMorvan Sorry missed this one a little bit.
There is no one threshold, and the trimming params are something that are best chosen by looking at the quality profile of your data and picking what's appropriate.
The goal is just to pick something that works well -- usually there is a range of right choices. In order, the things that you need to meet are: (1) Maintain overlap after truncation, if you don't have 20 nts of overlap after the truncLen
is enforced, the merging will fail, (2) truncate off the low-quality tails, it is common for quality crashes to occurs in Illumina sequencing, especially reverse reads, and it is best to trim those off as much as you can.
Thanks for your answer @benjjneb, My forward reads after filt&trim have a min Qscore of around 30 while my reverse reads have a min Qscore of around 25. In the end, I have 50 000 - 60 000 non chimeric reads. Should I be more stringent in the filt&trim parameters?
Can you state the basic information on your amplicon sequencing setup?
What is the size of your amplicon (or what primers are you using)?
Are the primers on the reads?
What are the plotQualityProfile
plots from the F/R reads?
@benjjneb Illumina MiSeq 2x300 Amplified region: 16S V3-V4
Primers Forward : 305F : CCTACGGGNGGCWGCAG 17nts Reverse : 801R : GACTACHVGGGTATCTAATCC : 21 nts I removed them as it was advised in the removing chimera part
The plotQualityProfile plots from the raw reads : Forward reads: Reverse reads:
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
truncLen=c(275,250), ## Cuts the forward and reverse reads at desired lengths
trimLeft=c(17,21), ## Removes the primers
maxN=0,
maxEE=c(3,5),
truncQ=2,
rm.phix=TRUE,
compress=TRUE,
multithread=TRUE) ## On Windows set multithread=FALSE
The plotQualityProfile plots from the filt&trimmed reads : Forward reads: Reverse reads:
mergers20 <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, minOverlap=20, verbose=TRUE)
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
42,3 % of chimeric sequences which account for about 16,4% of the total sequence reads.
> head(track)
input filtered denoised merged tabled nonchim
SB_8 168857 107518 107518 68189 68189 54476
SB_27 168075 95209 95209 66930 66930 60890
SB_33 181031 115789 115789 74811 74811 60120
Thank you for your time!
That looks pretty reasonable to me. There isn't much more trimming you can do as it appears you are at the limit for 20bp overlap. The maxEE
setting for R2 is a bit high IMO - I'd try leaving it at 3 as well and see how many more sequences you lose. Generally prefer to be conservative with discarding sequences as it appears you have plenty of data left over anyways.
Agree with @fanli-gcb that looks pretty reasonable.
The one thing I might be slightly concerned about is the fraction of reads successfully merging (only ~65%), and whether some are being lost due to insufficient overlap. That's also because your sequenced amplicon is about 500nts, and you've trimmed to 275+250=525, which is close to the 20nt minimum overlap, and hence shorter than average sequences might be getting lost in merging.
I'd try the following on a subset of samples, and see if a larger fraction of reads get merged: truncLen=c(280,255)
and minOverlap=12
. If not, I'd just go with what you've got already.
@fanli-gcb @benjjneb
This is what I obtain using a 20 nt minimum overlap. But as you can see, I obtained the exact same results using a 50 nt minimum overlap.
summary(mergers20$SB_27$nmatch) Min. 1st Qu. Median Mean 3rd Qu. Max. 57.0 63.0 84.0 76.3 85.0 103.0 `
summary(mergers20$SB_33$nmatch) Min. 1st Qu. Median Mean 3rd Qu. Max. 57.00 71.00 85.00 78.19 85.00 153.00
summary(mergers20$SB_8$nmatch) Min. 1st Qu. Median Mean 3rd Qu. Max. 57.00 73.00 85.00 79.21 85.00 149.00
Hello @fanli-gcb @benjjneb, I applied the modifications that you advised. Here is what I obtained.
Method 1: In this method, I used truncLen=c(280,255) and minoverlap=12. (I used truncLen=c(275,250) and minoverlap=20 in previous method). I left maxEE unchanged (3,5)
As you predicted @benjjneb, the proportion of successfully merged reads increased but it seems that it wasn't due to the reduced overlap as the minimum overlap in the 3 samples is 67 nt.
Merging
Final:
input | Filter and denoised | %/input | Merged | %/input | %/filtered | Non chimeras | %/input | %/merged | |
---|---|---|---|---|---|---|---|---|---|
SB8 | 168857 | 105765 | 57% | 66736 | 40% | 70% | 54309 | 32% | 80% |
SB27 | 168075 | 93654 | 51% | 65707 | 40% | 78% | 59993 | 36% | 91% |
SB33 | 181031 | 113858 | 57% | 72955 | 41% | 72% | 59878 | 33% | 80% |
vs what i had before:
input | Filter and denoised | %/input | Merged | %/input | %/filtered | Non chimeras | %/input | %/merged | |
---|---|---|---|---|---|---|---|---|---|
SB8 | 168857 | 107518 | 64% | 68189 | 40% | 63% | 54476 | 32% | 80% |
SB27 | 168075 | 95209 | 57% | 66930 | 40% | 70% | 60890 | 36% | 91% |
SB33 | 181031 | 115789 | 64% | 74811 | 41% | 65% | 60120 | 33% | 80% |
Method 2:
In this method, I reduced the MaxEE for the reverse (maxEE=(3,3)) as @fanli-gcb suggested and I used the same parameters as the first method (truncLen=c(280,255) and minOverlap=12).
With this method, I lose some reads but the proportions kept in each step are roughly the same as the first method I used. The quality must have improved although it is not quite visible on the quality profiles.
Forward filtered reads
Reverse filtered reads
Merging
Final :
input | Filter and denoised | %/input | Merged | %/input | %/filtered | Non chimeras | %/input | %/merged | |
---|---|---|---|---|---|---|---|---|---|
SB8 | 168857 | 96832 | 57% | 61152 | 36% | 63% | 50005 | 30% | 82% |
SB27 | 168075 | 85938 | 51% | 60393 | 36% | 70% | 55300 | 33% | 92% |
SB33 | 181031 | 103932 | 57% | 66978 | 37% | 64% | 55832 | 31% | 83% |
vs what I had before:
input | Filter and denoised | %/input | Merged | %/input | %/filtered | Non chimeras | %/input | %/merged | |
---|---|---|---|---|---|---|---|---|---|
SB8 | 168857 | 107518 | 64% | 68189 | 40% | 63% | 54476 | 32% | 80% |
SB27 | 168075 | 95209 | 57% | 66930 | 40% | 70% | 60890 | 36% | 91% |
SB33 | 181031 | 115789 | 64% | 74811 | 41% | 65% | 60120 | 33% | 80% |
In your opinion, is there one "better" than the other?
Hi @benjjneb I came to this thread searching for info on how truncLen treats sequences of lengths <truncLen and found your explanation:
truncLen both truncates to the specified length, and throws away any sequences <truncLen.
Is there a particular reason that truncLen throws out such sequences? I've noticed in some samples there are amplicons much smaller than the majority of amplicons; they turn out to be host contamination, but I feel like I should exclude such contamination after taxonomy assignment rather than this early in the analysis. Do you have any thoughts about that?
Thanks!
Is there a particular reason that truncLen throws out such sequences?
Good reason: That is usually the desirable behavior and in general we would recommend it.
Historical reason: The original version of dada2 required same-length sequences, but that's no longer the case.
I've noticed in some samples there are amplicons much smaller than the majority of amplicons; they turn out to be host contamination, but I feel like I should exclude such contamination after taxonomy assignment rather than this early in the analysis. Do you have any thoughts about that?
It's fair to want to do it that way, and I agree it can be useful in some cases. We are planning on adding a trimRight
parameter to filtering, and I think this is another reason to do so: https://github.com/benjjneb/dada2/issues/471#issuecomment-389569776
Is there a way to truncate based on expected errors instead of minimum quality score?
I find (as above) that setting a minimum quality with truncQ
is much more stringent and intuitively less relevant than filtering based on expected errors. However, the maxEE
filtering is dependent on the sequence length because if you set your truncLen
too long you will have a lot of expected errors due to poor quality bases at the end. It would be nice to go until you hit >= 2 maxEE, and truncate the sequences there. Then you could choose the truncLen based on the length distribution of only not-terrible sequences, which would also make the median quality score metric more relevant (not dragged down by seqs you will toss no matter what).
As is, I usually approximate this in multiple steps- choosing several truncLen options and seeing how many reads are retained for each at my maxEE cutoff. However, a "truncE
" would streamline this decision process, and provide a view of sequence quality based only on reasonable reads.
I find (as above) that setting a minimum quality with truncQ is much more stringent and intuitively less relevant than filtering based on expected errors. However, the maxEE filtering is dependent on the sequence length because if you set your truncLen too long you will have a lot of expected errors due to poor quality bases at the end. It would be nice to go until you hit >= 2 maxEE, and truncate the sequences there.
There is a reason that neither truncQ
or truncEE
would be advisable as the main filtering strategy in most circumstances: The core algorithm is relying on repeated observations of the exact same sequence to identify biological variants, and "ragged" filtering splits identical variants into multiple "truncation alleles" based on where exactly they fell below, e.g., a truncEE
cutoff. While the algorithm will recombine those alleles, it still loses sensitivity to rarer variants. For example, a sequence variant supported by 10 reads could be split into 10 different truncation alleles, and then it would not be detected as each alone is a singleton.
Then you could choose the truncLen based on the length distribution of only not-terrible sequences, which would also make the median quality score metric more relevant (not dragged down by seqs you will toss no matter what).
As is, I usually approximate this in multiple steps- choosing several truncLen options and seeing how many reads are retained for each at my maxEE cutoff. However, a "truncE" would streamline this decision process, and provide a view of sequence quality based only on reasonable reads.
This is an interesting idea though. I think the way we would implement it would be as a maxEEstats
informational function, i.e. it wouldn't be rolled into the filtering function to avoid ragged filtering, but would inform the choice of filtering parameters.
I like it, we should do something like this.
USEARCH has had the maxEEstats function for quite sometime now. I always use that to decide on truncLen and maxEE, ofcourse would be great to have it as part of DADA2 as well
Hi everyone, based on everything I read here, I wrote, several month ago, a r-scripts for my lab mates https://github.com/RemiMaglione/r-scripts/blob/master/Qual_vs_MaxEE_plot.R
This r-scripts is a kind of decision tools that can help you to visually decide what parameters to choose for both MaxEE
and TruncLen
. It's based on reads quality, extracted with fastQC, and It produce this kind of output:
Since I never stopped using it, I thought I had to share it with you. Enjoy !
@benjjneb : I didn't know if I had to create a new threads, I though it was a good idea to drop into this thread since everything starts here. Feel free to move it elsewhere ;) And btw, in the same spirit of this script, it could be interesting to have a dada2 function that draw the maxEE plot like the plotQualityProfile()
function: plotMaxEEProfile()
or plotQualityProfile(..., MaxEE =TRUE)
?
Very cool plots @RemiMaglione
At least in the case of this particular data, my interpretation is that increasing maxEE doesn't buy you a whole lot of additional reads. Trimming would be much more effective.
Sad to see that the 2x300bp reads are still so dicey
@RemiMaglione That is pretty awesome.
it could be interesting to have a dada2 function that draw the maxEE plot like the plotQualityProfile() function: plotMaxEEProfile() or plotQualityProfile(..., MaxEE =TRUE) ?
Agreed, your plots make a pretty compelling case for adding something like that.
my interpretation is that increasing maxEE doesn't buy you a whole lot of additional reads. Trimming would be much more effective.
This is usually the case just due to the nature of the Illumina quality "crash" that often occurs at the end of reads (especially 2x300 kit).
Hello, your information has been very useful. In my case, I had to choose between letting pass the filterAndTrim function around 75% of the raw sequences with a 120b length of the Reverse sequences (bad quality) or letting pass 85% of the raw sequences with 90b length of the Reverse sequences. At the end, I decided to use the first filter so 75% of the raw sequences passed the FilterAndTrim function. After merging, I got around 40-50% of the raw sequences remain. Do you think that is ok? or should I be less strict with the filtering steps? Could you send me some good paper so I can read about it? My samples are from bacterial 16S ARNr Illumina Miseq and the amplicon length is 250 bp. Thank you all.
@LucianaPMora There's nothing wrong with stricter filtering when you are just giving up a marginal (e.g. 10%) of reads in the process. More lower quality reads doesn't really help anything at the end (remember, this is all relative abundance data anyway).
Thank you for your answer. I think the same too, I performed indeed the same analysis with no expected errors filtering and the main conclusions are equal. I just feel like I am loosing too much information if I kept just 40-50% of the raw sequences after merging, but I understand what you said about the quality.
In my case, I have really good qualities in the F sequences and their lenght is enough to complete the amplicon. I wonder if I should shorten the R sequences so that more R sequences can pass the filter (even if they are 90 bases long) with good quality and perhaps I can get more than 50% sequences after merging. I just dont know how many reads usually pass this filters, so I am not sure if this is normal to happen.
Thank you very much again, this thread has been very useful.
Working with several different sequencing runs of various qualities I am trying to develop intuition for what reasonable quality filtration parameters are in the function
filterAndTrim
. The relevant parameters aretruncLen
,maxEE
, andmaxN
(based on my reading).Looking through closed issues, I would summarize the guidance as follows:
Thread Maintain at least 10% of raw reads after all filtration steps (
filterAndTrim
as well as downstream merging). UsemaxEE
as the primary filtering parameter. Values from 2 to 6 are shown in various issues. Run a subset of samples through the entire pipeline to determine if full pipeline produces too few features.Thread 60% of reads passing
filterAndTrim
is good.Thread Finding the best trimming location may just mean doing a grid search.
Thread For merging reads, don't go below 8 nucleotide overlap.
Thread Chimeras as a percentage of sequences should be less than 30% generally.
My situation is the following: 300bp paired end reads (full overlap of forward and reverse read) with forward reads generally much higher quality than reverse reads. Forward read Reverse read
Forward and reverse reads seem like they ought to successfully span the full amplicon at fairly high quality using
truncLen(c(225, 125)
. However, usingtruncLen(c(225, 125), maxEE=c(2,2), truncQ=11
, I am losing 55-65% of sequences on average.Two questions:
Is this a reasonable number of sequences to lose at the filter step? Do I need to do a grid search through length, maxEE, and truncQ with some sort of function that maximizes sequences without dropping quality below a certain value? What maxEE/min truncQ would you use during this search?
In your opinion, when you have fully overlapping amplicons, is paired end merging worthwhile? On one hand, given that forward reads will have low quality towards the end and reverse reads high quality at the beginning, it seems that merging ought to give me full amplicons at reasonable confidence. On the other hand, if I just used the entire forward read (or even the first 250 nt), would I capture about 90% of what I'd get with a 300bp amplicon, and significantly reduce my computational headaches?
As usual, thanks for the great tool and the helpful feedback. The issues really are nice to have around to get a handle for the thinking behind the parameters etc.