benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Is filterAndTrim effect to NA results or because minBoot? What parameter setting ​​should be used appropriately? #2008

Open WP-SMILE opened 2 months ago

WP-SMILE commented 2 months ago

Dear dada2 enthusiasts,

I'm pretty new to dada2 so I would like to hear some opinions from those who are experts and use dada2 regularly about my issue analysis.

I have tried using filterAndTrim with multiple parameter setups (e.g. truncQ=2, 5, or 7; minLen = 50 or 75) and found that

out.plant <- filterAndTrim(plant.raw, filtered_plant, maxN=0, minLen = 50, maxEE=2, truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

is the good parameter for now.

I followed the dada2 tutorial steps until I reached the assignTaxonomy function.

Then I use: plant.taxa <- assignTaxonomy(seqtab.nochim, plant.ref, multithread = TRUE, tryRC = TRUE, minBoot=50, outputBootstraps = TRUE) I found that after using assignTaxonomy with minBoot=50 I got half the NA results (which is not a good sign?).

I'm wonder if I missed a step or if I should recalibrate the parameters? But which parameters are actually appropriate? Looking forward to hearing valuable advice from you guys and please don't hesitate to ask any further questions.

Thank you very much :)

benjjneb commented 2 months ago

I'm wonder if I missed a step or if I should recalibrate the parameters? But which parameters are actually appropriate?

I wouldn't suggest recalibrating the parameters, which I guess in this case would be reducing minBoot to below 50.

The most common cause of chunks of the data having NA assignments (at which taxonomic level?) is the presence of sequences in the data that are not represented in the reference database. This could be because the database itself is incomplete. Another common cause is off-target sequences in the data, e.g. amplified host DNA in host microbiomes. There could also be "nonsense" sequences of various kinds (e.g. unremoved adapters, low complexity sequences, ...). plotComplexity is an easy way to check for low complexity sequences, and other tools like fastqc can help detect other things like adapters.

WP-SMILE commented 2 months ago

I wouldn't suggest recalibrating the parameters, which I guess in this case would be reducing minBoot to below 50.

Thank you for taking your valuable time to give us your advice. I removed adapters and low complexity sequences. I have tried reducing minBoot to below 50 (until 0), the result is dependent on the minBoot setting. As I see most researchers use minBoot at 80% and very few use 50%. So I am quite skeptical if we use minBoot less than 50, will it be reliable?

The most common cause of chunks of the data having NA assignments (at which taxonomic level?) is the presence of sequences in the data that are not represented in the reference database. This could be because the database itself is incomplete. Another common cause is off-target sequences in the data, e.g. amplified host DNA in host microbiomes. There could also be "nonsense" sequences of various kinds (e.g. unremoved adapters, low complexity sequences, ...). plotComplexity is an easy way to check for low complexity sequences, and other tools like fastqc can help detect other things like adapters.

Previously I forgot to mention that: (1) We used single-read sequencing (the resulting sequence length is between 100-150 bp), (2) the metabarcoding primer is trnL-P6.

We have tried using a large reference database (299,137 accession numbers). Also, when we are trying to use the reference database only for flora in the study area, the results were still the same. Thus, I don't think the problem is the reference databases.

benjjneb commented 2 months ago

I have tried reducing minBoot to below 50 (until 0), the result is dependent on the minBoot setting.

To clarify what I said above, I would not suggest "suggest recalibrating the parameters, which I guess in this case would be reducing minBoot to below 50."

We have tried using a large reference database (299,137 accession numbers). Also, when we are trying to use the reference database only for flora in the study area, the results were still the same. Thus, I don't think the problem is the reference databases.

Large is not the same as comprehensive when it comes to reference databases. There may be taxa that are in your data but not well-represented in the reference database. I am not very familiar with trnL primers, but off-target amplification is another possibility. Using an approach like BLAST-ing some unclassified sequences against nt might shed some light -- do they show up as something other than the trnL gene?

WP-SMILE commented 2 months ago

To clarify what I said above, I would not suggest "suggest recalibrating the parameters, which I guess in this case would be reducing minBoot to below 50."

Could you please clarify if this means that the minBoot value of 50 shouldn't be lowered further? Apologies for asking again, I just want to avoid any confusion.

Large is not the same as comprehensive when it comes to reference databases. There may be taxa that are in your data but not well-represented in the reference database. I am not very familiar with trnL primers, but off-target amplification is another possibility. Using an approach like BLAST-ing some unclassified sequences against nt might shed some light -- do they show up as something other than the trnL gene?

Initially, we used the pre-made rCRUX database and then we filtered out species/genera that were not represented in the study area. After performing BLAST for NA, we found that the plant families and genera we identified matched those in GenBank. Importantly, these sequences are also present in our reference database.

All the best, WP

benjjneb commented 2 months ago

Could you please clarify if this means that the minBoot value of 50 shouldn't be lowered further?

Don't lower minBoot to less than 50.

Initially, we used the pre-made rCRUX database and then we filtered out species/genera that were not represented in the study area. After performing BLAST for NA, we found that the plant families and genera we identified matched those in GenBank. Importantly, these sequences are also present in our reference database.

Are the hits you get BLAST-ing against nt for the NA ASVs hitting the trnL gene?

The other way that NA is assigned is if the sequence is not discriminatory enough, that is the sequence is similar to several taxa at the (e.g.) genus level, which will usually yield and NA assignment at that level. If you are seeing NA results at the more resolved levels, but definite assignments at the higher levels (class, order, etc.) then this is likely what is going on.