c383d893 / AMF-LSU-Database-and-Pipeline2

Updated Database and pipeline for phylogenetic determination of AMF from environmental sequences
4 stars 1 forks source link

Deviations from ITS dada2 pipeline explained #7

Open connor-morozumi opened 3 months ago

connor-morozumi commented 3 months ago

Hello. Thanks for the helpful repo.

I am trying to understand some of the decisions you all made and wanted to get some clarification. I don't see a step where you pre-filter for ambiguous Ns before doing cutadapt. Did you all determine this was not necessary for AMF primer set you developed?

https://benjjneb.github.io/dada2/ITS_workflow.html

fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filtered files in filtN/ subdirectory fnRs.filtN <- file.path(path, "filtN", basename(fnRs)) filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE)

It seems to me that most of the pipeline follows the 16s standard dada2 protocol but with some deviations such as employing tools like cutadapt and fastQC. I was hoping to learn how these choices were made. I'll email authors as well since this isn't a bug or issue per se. Thanks!

RobJamesRamos commented 3 months ago

Hello Conner, Thanks so much for you feedback. I've looked into this a bit, but must admit I don't fully understand all the differences between the standard dada2 tutorial and the dada2 ITS protocol. It does seem that the filtN step is attempting to solve a problem with identifying and removing some short primer sequences, "The presence of ambiguous bases (Ns) in the sequencing reads makes accurate mapping of short primer sequences difficult" 1. I don't fully understand why this is a bigger issue in ITS data, since this step is missing in their non-ITS tutorial.

All that said, there may be other benefits to filtering out sequences that contain N's that I am unaware of. Anecdotally I don't think ambiguous N's have been common in the data I have gotten from Illumina sequencing in the past. I still would be worried about changing the default pipeline to drop additional sequences if there was no solid benefit. I will keep this open as a feature request that is under review. If you or anyone have additional context about this issue I am happy to review it.

connor-morozumi commented 3 months ago

Hi Rob,

Thanks for the quick reply. I agree it is not quite clear why they do the N base removal in a preceding step and then again once primers are cut off (the default setting and requirement for dada2 is 0 N, maxN = 0 within filterAndTrim). I will reach out to Benjamin to ask. He is quite quick on his repo from my past experience.

I think whether to implement a more ITS flavored approach might come down to whether your AMF region is of variable length as ITS is. Do we know whether this is the case for the region amplified for the primer set you all have developed? As you are getting your mock communities back it might not matter and the 16s-style pipeline will work just fine.

many thanks!