benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
466 stars 142 forks source link

mergeSequenceTables library normalization #1841

Closed nikoladaskova closed 8 months ago

nikoladaskova commented 11 months ago

Hello Ben,

I have two questions about my latest sequencing project.

Firstly, I would like to compare the relative abundance from two different studies - one study is the American Gut Project, which focused on the V4 region of 16S rRNA, they did 150 single-end sequencing, and the other study is mine, which also focused on the V4 region, but we did paired-end sequencing. We used the same primers. My question is, if I merge the ASV tables using the mergeSequenceTables function (where all reads must start and end at the same position) and assign a taxonomy, can I directly compare the relative abundances of certain taxa (proportions) without further normalization (both studies have different library preparation strategies, so I'm not sure if I need to do any normalization before comparing relative abundances)?

I also have another question - the company doing the sequencing has run 150 paired-end sequences on NovaSeq, but I'm a little concerned about the merge. The V4 region is approximately 254 bp long (but there is biological variation in length). I have received the FASTQ read files including primers. This means that 151 (forward) +151 (reverse) = 302 bases - 19 (forward primer) - 20 (reverse primer) leaves me with only 9 bases to merge. However, when using the minOverlap parameter in the mergePairs function it worked (I got the same merge results when using 9, 7 and 5 nts of overlap). However, should I trim at least 1 base at the end of each read? The Q-scores look good, but I thought it's always good to trim at least one base at the end of each read. However, I'm not sure if this only applies when sequencing MiSeq. In your opinion, would you prefer to use 250 paired-end sequencing to be safe? Unfortunately, the cost would obviously be much higher. Also, when using NovaSeq with 2-dye chemistry, there is a base calling issue and the error model needs to be adjusted due to the grouped Q-scores. I also used the mock community from Zymo in my study and the taxonomy fits, the expected proportions of each taxon in the mock sample not so much.

Note: I recently saw a paper where they compared Shannon entropies across the entire 16S rRNA gene, and it seems I may be a bit relaxed about the merging because there is a more conservative area in the merging region of V4 (using primers 515F and 806R), so my short overlap shouldn't mess up the output, right?

https://www.researchgate.net/publication/337058304_Evaluation_of_16S_rRNA_gene_sequencing_for_species_and_strain-level_microbiome_analysis/figures?lo=1

Thank you very much, Nikola

benjjneb commented 10 months ago

My question is, if I merge the ASV tables using the mergeSequenceTables function (where all reads must start and end at the same position) and assign a taxonomy, can I directly compare the relative abundances of certain taxa (proportions) without further normalization (both studies have different library preparation strategies, so I'm not sure if I need to do any normalization before comparing relative abundances)?

Directly integrating studies that use different methodologies has issues. There are probably different biases between them in which taxa are detected better or worse. There may be even bigger differences in study design or sample recruitment.

We have a lot say about this in our paper about metagenomic bias and in our paper looking at a meta-analysis of the vaginal microbiome in preterm birth.

This means that 151 (forward) +151 (reverse) = 302 bases - 19 (forward primer) - 20 (reverse primer) leaves me with only 9 bases to merge.

Are the primers sequenced? The most common V4 library preperations do not sequence the primers, thus have more than adequate overlap between 2x150 reads. You can determine if the primers were sequenced by inspecting your raw fastq files.