benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 142 forks source link

number of reads per sample? #1452

Closed marwa38 closed 2 years ago

marwa38 commented 2 years ago

Hello

Could you please advise if you think it is a good practice to subset samples that have a higher number of reads in comparison to other samples? after running dada2 for downstream analysis. As it seems that more reads mean more features comparatively.

What do you think at what threshold I should say that those samples need to be removed? regarding the number of reads in the sample comparatively to other samples

Cheers Marwa

benjjneb commented 2 years ago

Controlling for library size effects is critical for a number of types of microbiome analysis. There's a pretty large literature on this. Rarefaction (subsampling to a fixed library size across samples) is common. There are other approaches specific to e.g. differential abundance analysis that can be more powerful than rarefying. For some beta-diversity methods, just making sure you convert to proportions can be sufficient.

What do you think at what threshold I should say that those samples need to be removed?

Plot a histogram of your sampling depths. Commonly most of the samples will have a comparable numebr of reads, while a few will have far fewer reads (libraries that didn't form well). Choose your threshold accordingly. Absolute numbers are less important as long as you aren't getting super low numbers (e.g. < 1000).

marwa38 commented 2 years ago

Many thanks @benjjneb

For some beta-diversity methods, just making sure you convert to proportions can be sufficient.

By proportions you mean relative abundance data? ps.ra <- transform_sample_counts(ps, function(ASV) ASV/sum(ASV))

benjjneb commented 2 years ago

The function you describe is creating proportions, which are one specific scaling of relative abundance data.