benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Most appropriate Figaro maxExpectedError values? #1016

Closed vrbacky closed 4 years ago

vrbacky commented 4 years ago

What maxExpectedError combination provided by Figaro is recommended for Dada2? Shoud I use the best [2, 3] result provided by Figaro or do you recommend to stick with the best [2, 2] result as recommended by the Dada2 tutorial (maxEE=c(2,2))? Thanks.

benjjneb commented 4 years ago

I'm hesitant to make any strong claims about a single best parameter value. Remember that defaults are chosen to be reasonable for a wide range of data, but they are never perfectly optimized for any one dataset.

That said, relaxing maxEE to c(2,3) is not a drastic change from the defaults, and would be generally reasonable.

michael-weinstein commented 4 years ago

The maxEEs provided by FIGARO are based on modeling the error accumulation of your specific read data. I agree with @benjjneb that it's extremely difficult to say that there is an objectively "right" set of parameters, but FIGARO is there to help find optimal parameters that will balance read retention rates against a desire for high quality reads while suggesting truncation sites that should allow for good rates of both read retention and quality optimized for your specific data.
In my experience, it is more important to monitor your FIGARO recommended trimming sites and maxEE values run to run, instead of aiming to match someone else's parameters, and the parameters shouldn't be expected to match for different amplicons. If you see a major change in the parameters FIGARO returns for the same amplicon, it would be a good idea to check for library prep issues. If you start repeatedly seeing a change to the parameters you're used to for a given amplicon, or a distinct trend forming, check for potential faults in your sequencer (I speak from experience on this one, as the program detected a fault in the miSeq that required Illumina service before the humans did). If you are running FIGARO with the default quality cutoff, it should be aiming to throw out any read that is one standard deviation or more worse than the average read at the selected truncation position in terms of expected error, which seems to be a pretty reasonable position (and will generally result in 70-80% of your reads passing the TrimAndFilter expected error filtering step.

vrbacky commented 4 years ago

Thank you very much for your perfect answers. Also, my apologies for the slow reply, last months have been a little overwhelming. And just a dumb question. My wet lab colleagues prefer 2x251 sequencing. I use Trimmomatic to trimm all paired-end sequenced reads to the same length (250 bp) because one base is missing in a lot of sequences (length = 250 bp). Is it OK? Retention rate falls rapidly, when I use full-size 251 bp (~99% -> ~60% but sometimes less than 15%). Thank you.

michael-weinstein commented 4 years ago

Generally, it's a good idea to drop that last n+1 base. Check out this page for a discussion on it. That last one often has a very high error rate, and since the expected error (EE) value is calculated for the whole read, retaining that last base will significantly increase the EE value, which will cause many more reads to be filtered out. Figaro accounts for this kind of thing, as it bases cutting choices off the average expected error of reads and gives an optimized cutting strategy for a DADA2 pipeline, so an additional trimmer should only be needed if your FASTQ file has reads of varying length in order to get them all back to one length. Otherwise, trimming before Figaro is likely an unnecessary step.

Also, just out of curiosity, where are you located and what are you studying?

vrbacky commented 4 years ago

Thanks for your answer. I decided to trim sequences to 250 bases using Trimmomatic (CROP:250 MINLEN:250). I don't lose too many reads (up to 3%) and my retention rate calculated by FIGARO is about 80% (confirmed by real DADA2 filtering results). It seems to be OK with me.

BTW, I work in the Czech Republic studying the human microbiome.