benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Filtering forward reads #1216

Closed arielganz closed 3 years ago

arielganz commented 3 years ago

Hi there! Sorry if this is already answered elsewhere but I'm just having a hard time figuring it out so thought I would ask.

I have some reads that had paired end sequencing (V4, 515F, 806R) with the NextSeq 500 platform. They were done by uBiome before they shut down. Unfortunately they were only sequenced 150 bp in each direction (to~50k-150k depth). The read quality on the reverse reads was poor and there isn't overlap, so I'm proceeding with the forward reads only.

I'm using the following code to filter them but I'm having trouble retaining high quality while retaining enough reads. I am / have tried playing around with raising maxEE to 3 or lowering minLen to 120, for example, but it didn't make a big difference. (this is on a subset of the 312 samples).

out <- filterAndTrim(fwd = fnFs[1:20], filt = noprimerFs[1:20], rev = NULL, filt.rev = NULL, trimLeft = FWD_PRIMER_LEN, maxN=0, maxEE=2, truncQ=20, minLen = 128, rm.phix=TRUE,compress=TRUE, multithread=TRUE)

out: reads.in reads.out 185326469_NA0021455372_5_gut_R1_001.fastq 135282 16279 187325115_NA0021455068_5_gut_R1_001.fastq 142155 45537 187325367_NA0021458504_5_gut_R1_001.fastq 212854 62745 212325545_NA0021458494_5_gut_R1_001.fastq 97348 9198 218297426_NA0021483924_5_gut_R1_001.fastq 125345 12828 219326445_NA0021455030_5_gut_R1_001.fastq 47400 18033 221298791_NA0021494141_5_gut_R1_001.fastq 88499 1804 223297294_NA0021483948_5_gut_R1_001.fastq 179186 19447 224325473_NA0021455033_5_gut_R1_001.fastq 15090 1582 224325538_NA0021455026_5_gut_R1_001.fastq 416169 247789 228325689_NA0021455050_5_gut_R1_001.fastq 197234 92725 229326047_NA0021455045_5_gut_R1_001.fastq 401213 135879 235325831_NA0021455066_5_gut_R1_001.fastq 46068 1298 235326231_NA0021455052_5_gut_R1_001.fastq 106563 10724 238326299_NA0021455037_5_gut_R1_001.fastq 119288 11946 241325749_NA0021455054_5_gut_R1_001.fastq 70673 5170 241326409_NA0021455056_5_gut_R1_001.fastq 73373 5889 257325655_NA0021455040_5_gut_R1_001.fastq 174949 2375 281326332_NA0021458506_5_gut_R1_001.fastq 117758 31279 281326390_NA0021458499_5_gut_R1_001.fastq 87021 3105

I'm slightly confused about maxEE and truncQ. What I'd really like to do, though can't figure out how, is to use a quality cutoff of, for example, 28, but discard any reads that have >2 instances of e.g. Q < 28 (rather than cutting at the first instance of Q < 28. Is that what it is doing?

Examples of Before: plotQualityProfile(fnFs[8:10]): image

Same Samples After: plotQualityProfile(noprimerFs[8:10]): image

It's also possible I need to drop a few samples rather than lowering the overall quality but I don't have a huge number that I can drop based on sample size--also these are longitudinal so it depends which ones gets dropped as well.

There is not a possibility to resequence these and they are quite unique so I want to do the best job possible with this and draw conclusions at the level of resolution to which they can speak.

What is your suggestion around how to proceed?

arielganz commented 3 years ago

I actually think I will go with the in house processing that uBiome did, even though they didn't use denoising they were able to use both forward and reverse reads and keep the average quality score above 30, and leverage their custom pipeline, which seems better than what would be possible here. I'd still be curious to understand what the right move would be though for this type of data as well as exactly what maxEE does.

benjjneb commented 3 years ago

I'm slightly confused about maxEE and truncQ. What I'd really like to do, though can't figure out how, is to use a quality cutoff of, for example, 28, but discard any reads that have >2 instances of e.g. Q < 28 (rather than cutting at the first instance of Q < 28. Is that what it is doing?

maxEE calculated the number of expected errors in the read based on the quality scores, then filters out reads above the threshold that is set. This is typically better than a mean quality score filter, and you can read more about it here: https://doi.org/10.1093/bioinformatics/btv401

filterAndTrim does not have functionality built in to do the >2 instance of Q<28 filtering you are proposing unfortunately.

I actually think I will go with the in house processing that uBiome did, even though they didn't use denoising they were able to use both forward and reverse reads and keep the average quality score above 30, and leverage their custom pipeline, which seems better than what would be possible here. I'd still be curious to understand what the right move would be though for this type of data as well as exactly what maxEE does.

That seems reasonable, I'd expect that their in-house processing probably does a good job of dealing with the technical wrinkles in their data which is a bit outside the current norm (binned Q scores, low overlap between reads). As to what the best way forward if you did want to run DADA2 on this, using the forward reads alone might be fine although I didn't see how low the quality was in the reverse reads so I'm not 100% sure. But I'd probably keep essentially the entire forward reads as they have good quality throughout.

A note: I'd also confirm whether (or not) the primers are included in the reads. Most V4 library preparations do not sequence the primers, thus you wouldn't want to be trimLeft-ing at all in that case.

arielganz commented 3 years ago

Thank you, Benji. Wow you're fast!

  1. yes they gave us the raw reads + primers included (which I verified by looking at the reads)
  2. I realized that unforunately they switched their in house processing part way through our longitudinal study so I can compare only the first four time points that way and the fifth is different
  3. What do you think then in terms of proceeding with these? The reverse reads the quality drops off a lot faster, but there is zero overlap so my understanding was that it wouldn't be able to do it as paired end anyway? Is that correct?

What do you mean that you would keep the entire forward reads? I'm confused for example above should I get rid of truncQ at 22? That seems low? Regarding maxEE, is that an alternative to truncQ? e.g. if I use maxEE and then also put in some sort of min quality?

arielganz commented 3 years ago

When you said that the quality seems high throughout, do you mean in the "after?"

benjjneb commented 3 years ago

Thank you, Benji. Wow you're fast!

Ha well, I do try to not work holidays so not so fast over Thanksgiving.

When you said that the quality seems high throughout, do you mean in the "after?"

It seems high in the before too.

What do you mean that you would keep the entire forward reads?

I think I misunderstood your minLen for truncLen -- you aren't truncating at all actually. I might throw in a truncLen=150 just to get rid of the 151st base if it is there.

I'm confused for example above should I get rid of truncQ at 22? That seems low? Regarding maxEE, is that an alternative to truncQ? e.g. if I use maxEE and then also put in some sort of min quality?

Yes, maxEE is generally preferred to truncQ filtering, as truncQ introduces technical length variation. I would just ditch truncQ/minQ and use a maxEE threshold instead.

arielganz commented 3 years ago

You are still so fast. I'm really inspired that you respond so rapidly to questions and have been doing so for years. It's an incredible resource that you are providing to the community and I really cannot express how moved I am by your devotion and dedication.

Thank you for the advice! I will try it :)

arielganz commented 3 years ago

This is what I got using the suggested parameters

out <- filterAndTrim(fwd = fnFs[1:20], filt = noprimerFs[1:20], rev = NULL, filt.rev = NULL, trimLeft = FWD_PRIMER_LEN, maxN=0, maxEE=2, truncLen=150, rm.phix=TRUE,compress=TRUE, multithread=TRUE)

out: reads.in reads.out 185326469_NA0021455372_5_gut_R1_001.fastq 135282 132842 187325115_NA0021455068_5_gut_R1_001.fastq 142155 141178 187325367_NA0021458504_5_gut_R1_001.fastq 212854 203441 212325545_NA0021458494_5_gut_R1_001.fastq 97348 95118 218297426_NA0021483924_5_gut_R1_001.fastq 125345 123766 219326445_NA0021455030_5_gut_R1_001.fastq 47400 47202 221298791_NA0021494141_5_gut_R1_001.fastq 88499 86826 223297294_NA0021483948_5_gut_R1_001.fastq 179186 176167 224325473_NA0021455033_5_gut_R1_001.fastq 15090 14849 224325538_NA0021455026_5_gut_R1_001.fastq 416169 406081 228325689_NA0021455050_5_gut_R1_001.fastq 197234 196532 229326047_NA0021455045_5_gut_R1_001.fastq 401213 397217 235325831_NA0021455066_5_gut_R1_001.fastq 46068 43560 235326231_NA0021455052_5_gut_R1_001.fastq 106563 104491 238326299_NA0021455037_5_gut_R1_001.fastq 119288 116586 241325749_NA0021455054_5_gut_R1_001.fastq 70673 68681 241326409_NA0021455056_5_gut_R1_001.fastq 73373 71522 257325655_NA0021455040_5_gut_R1_001.fastq 174949 171367 281326332_NA0021458506_5_gut_R1_001.fastq 117758 112895 281326390_NA0021458499_5_gut_R1_001.fastq 87021 85338

I'm a little confused looking at the quality plots because they look similar to the before. And it seems to still dip down quite low with the red line? Is this still considered "good quality?" also, what does maxEE use as the quality cutoff cutoff in terms of only two below the cutofff?

before (5-7) image

after (5-7) image

before (8-10) image

after (8-10) image

benjjneb commented 3 years ago

And it seems to still dip down quite low with the red line?

The red line is totally separate from the quality information, it is a separate line that uses the right y-axis that shows the cumulative distribution of lengths. This is making me reconsider this visual design choice though...

Is this still considered "good quality?"

I would say yes. Good is always relative though, I'd describe it as good enough to be useful to the DADA2 algorithm for denoising.

what does maxEE use as the quality cutoff cutoff in terms of only two below the cutofff?

maxEE imposes a threshold based on the entire read, not a single position. I really think the best way to understand what's going on there is from the manuscript that introduced this idea for amplicon sequencing quality filtering: https://doi.org/10.1093/bioinformatics/btv401

arielganz commented 3 years ago

The red line is totally separate from the quality information, it is a separate line that uses the right y-axis that shows the cumulative distribution of lengths. This is making me reconsider this visual design choice though...

Oh--got it--that was not apparent to me. The y axis on the right doesn't show up for me. Regarding the quality, they look essentially the same before and after filtering and it seems to be dropping almost to 25

Thank you for the suggested reading around maxEE!! Will check it out.

benjjneb commented 3 years ago

Regarding the quality, they look essentially the same before and after filtering and it seems to be dropping almost to 25

Yep, that's OK because the initial quality was high already, so only a relatively small fraction of low-quality reads are being removed. Q=25 for instance is still an estimate of ~99.6% likelihood that the nucleotide at that position is correct.

arielganz commented 3 years ago

Thanks, Benji. Yes I understand that though I've seen for example others use twice below 28-30. I do understand that it isn't a perfect process either way though! Thank you for your help I really learned a lot. Especially about those quality plots!