Remove chimeras, consensus vs. pooled

benalric commented 3 years ago

Dear Benjamin,

I work with time series datasets of metagenomic data in several sites. For each site, I have several runs.

I have a first question about what setting parameters to use in step to infer sample composition and in step to remove chimeras. Following the information in discussions #218, #887, #1042, I run several analyses with several combination of parameters: i. default: dada(…, pool = FALSE) and removeBimeraDenovo(…, method = “consensus”) ii. pooled1: dada(…, pool = TRUE) and removeBimeraDenovo(…, method = “pooled”, minFoldParentOverAbundance = 2) iii. pooled2: dada(…, pool = TRUEE) and removeBimeraDenovo(…, method = “pooled”, minFoldParanetOVerAbundance = 8) I made taxonomic assignation with the naive Bayesan classifer methods proposed in DADA2.

default | 735 | 3152227 | 107730 | Eukaryota | Heterocapsa_pygmaea | 100 pooled1 | 679 | 3133661 | 107730 | Eukaryota | Heterocapsa_pygmaea | 100 pooled2 | 715 | 2925222 | 107730 | Eukaryota | Heterocapsa_pygmaea | 100 diff_d_p1 | 65 | 27841 | 6308 | Eukaryota | Gyrodinium_spirale | 81 diff_d_p2 | 34 | 75363 | 53793 | Eukaryota | Gyrodinium_helveticum | 81 diff_p1_d | 9 | 15079 | 9954 | Eukaryota | Chaetoceros_sporotruncatus | 81 diff_p1_p2 | 9 | 65993 | 53793 | Eukaryota | Gyrodinium_helveticum | 83 diff_p2_d | 14 | 19488 | 9954 | Eukaryota | Chaetoceros_sporotruncatus | 80 diff_p2_p1 | 45 | 22880 | 6308 | Eukaryota | Gyrodinium_spirale | 87

In the table, ASV: number of ASVs, read: number of reads, max_read: number of reads for the most abundant ASV, identity: identity level after taxonomic assignation As you can see, we lost ASVs with pooled1 and pooled2 methods compared to default method. But with pooled2 method we lost an abundant ASV (53793 reads) which is present with pooled1 method. I don’t understand why, because in pooled2 method I set the minFoldParentOverAbundance to 8. Do you have any idea? Based on these results, can we say that pooled1 method will be a good compromise and therefore the parameter set to choose for these analyses.

My second question is related to how to handle several runs of different sites in the chimera removal step. I have multiple runs for one site and multiple sites. Since I then want to compare the sites to each other, I am wondering if I should merge all runs for one site and do the chimera removal step per site or merge all runs for all sites together and do the chimera removal step then. What do you think?

Thank you in advance for your answer.

All the best,

Benjamin

benjjneb commented 3 years ago

As you can see, we lost ASVs with pooled1 and pooled2 methods compared to default method. But with pooled2 method we lost an abundant ASV (53793 reads) which is present with pooled1 method. I don’t understand why, because in pooled2 method I set the minFoldParentOverAbundance to 8. Do you have any idea?

I don't understand how this could happen. Are pooled1 and pooled2 operating on the same dada(..., pool=TRUE) output? Just a thought, but could there be a typo causing the `minFoldParentOverAbundance to not be set as expected? I see one in the command in your post in pooled2.

Based on these results, can we say that pooled1 method will be a good compromise and therefore the parameter set to choose for these analyses.

On these rough numbers, they all look reasonable. I would want to clear up the abundant ASV issue though, and choose one of the methods that is keeping it (probably).

My second question is related to how to handle several runs of different sites in the chimera removal step. I have multiple runs for one site and multiple sites. Since I then want to compare the sites to each other, I am wondering if I should merge all runs for one site and do the chimera removal step per site or merge all runs for all sites together and do the chimera removal step then. What do you think?

Merge runs first, then do chimera removal after. This keeps chimera removal consistent across runs.

benalric commented 3 years ago

Dear Benjamin,

Thank you very much for your answer. I checked my script, and there is a mistake of typo here but not in the script. Sorry for that. I did a BLAST of the sequence of this abundant ASV lost with pooled2 and the sequence matches 100% with the sequence of the Uncultured marine Gymnodiniaceae clone RA071004T.038 18S ribosomal RNA gene. It seems that this sequence is not a chimera. Therefore, we should choose a method that preserves it. May be pooled1 is better since I have many runs of many sites.

All the best,

Benjamin

benalric commented 3 years ago

Sorry, just for information the sequence of the abundant ASV is : AGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGCTGAGGACGACCGGTCCGCCCTCCGGGTGAGCATCTGGTTCGGCCTTGGCATCTTCTTGGTGAACGTATCTGCACTTGACTGTGTGGTGCGGTACCCAGGACTTTTACTTTGAGGAAATTAGAGTGTTTCAAGCAGGCATACGCCTTGAATACATTAGCATGGAATAATAAGATAGGACCTCGGTTCTATTTTGTTGGTTTCTAGAGCTGAGGTAATGATTAATAGGGATAGTTGGGGGCATTCGTATTTAACTGTCAGAGGTGAAATTCTTGGATTTGTTAAAGACGGACTACTGCGAAAGCATTTGCCAAGGATGTTTTCA

benjjneb / dada2

Remove chimeras, consensus vs. pooled #1368