Closed otaviolovison closed 4 months ago
The expected amplicon lengths are based on specific V3V4 primer set and library setup (the "Illumina" v3v4 protocol), before trimming. Depending on your specific v3v4 primers, this may vary some.
It's not clear to me how you chose 27/31 for your primer lengths. Those are the lengths of the sequenced primers (plus sequenced padding) at the start of the forward/reverse reads respectively?
And the size of the overlap is variable. 12
is a minimum value that mergePairs
requires by default to merge the reads together. But any overlap bigger than that also will pass mergePairs
.
My guess is that you are using the "Ilumina" v3v4 setup, which starts at ~440=460 before trimming. Then you have trimmed off ~60 nts by trimLeft
and so have ~380-400 post-trimming. If that is true, I'd just update this to use the correct primer lengths as trimLeft
parameters (I think it is c(17,21)
?).
Yes, you are right!
About the 'extra 10' trimming with primers: in your paper 'Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses' you say: "We also choose to trim the first 10 nucleotides of each read based on empirical observations across many Illumina datasets that these base positions are particularly likely to contain pathological errors."
That's why I am trimming an extra 10 bases.
We also choose to trim the first 10 nucleotides of each read based on empirical observations across many Illumina datasets that these base positions are particularly likely to contain pathological errors.
This is no longer a recommendation. I would advise against trimming the extra 10 nts here.
Ok, thanks!
MSc. Otávio von Ameln Lovison CRF/RS 12363 Farmacêutico bioquímico Especialista em Citologia Clínica Especialista em Microbiologia Clínica Mestre em Ciências Farmacêuticas (CAPES 7) pela Universidade Federal do Rio Grande do Sul (PPGCF/UFRGS) *Doutorando *em Ciências Farmacêuticas (CAPES 7) pela Universidade Federal do Rio Grande do Sul (PPGCF/UFRGS) Instituto Nacional de Pesquisa em Resistência Antimicrobiana - INPRA
Laboratório de Pesquisa em Resistência Bacteriana - LABRESIS Laboratório de Microbiologia e Saúde Única - ICBS/UFRGS Núcleo de Bioinformática (Bioinformatics Core) do Hospital de Clínicas de Porto Alegre
Em ter., 6 de jun. de 2023 às 17:31, Benjamin Callahan < @.***> escreveu:
We also choose to trim the first 10 nucleotides of each read based on empirical observations across many Illumina datasets that these base positions are particularly likely to contain pathological errors.
This is no longer a recommendation. I would advise against trimming the extra 10 nts here.
— Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1749#issuecomment-1579406071, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL3LUXXA4XMDOCA2X3ECAX3XJ6HTTANCNFSM6AAAAAAY4QG73E . You are receiving this because you authored the thread.Message ID: @.***>
Hello!
I am having difficulties to explain the expected and the true final amplicon size (after preprocessing). I am reading some forum posts in which @benjjneb suggests that v3v4 amplicon size should be around 444 and 464 bp. That's the "after preprocessing" final size? Let's peform some calculations:
NGS performed with a v3 kit. Then we have 300 + 300 read's size. Based on my quality profile I performed a trunc on 280 and 230. That gives me a 510, with a very good margin to merge. Removing primers (27, 31 - primer length + 10 as suggested) I loose 58 bp, and I loose another 12 bp for overlap.. that should give me a 440 amplicon size.
In practice, I am having sequences ranging from 381 to 409 bp. So, I believe (please, correct me if I'm wrong) that the 444-464 amplicon size mentioned is the theoretical expected size, without preprocessing.. when I remove 58 bp from primers and 12 from overlap, I have a ~ 394 bp sequence size, which matches with my range (381 to 409).
Please, tell me if I am doing something wrong here. Thanks.