benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

dada2 noisy singletons #1663

Closed Andreas-Bio closed 1 month ago

Andreas-Bio commented 1 year ago

I have five plant nrITS amplicon sequence files (~650bp sequence length) from five plant leaves (PacBio 0.999 CCS reads) with 1189 reads (combined) and 95% of them are singletons.

The sequence quality scores are all almost maxed out.

DADA2 has trouble identifying the major variants from random noise/indels/polymerase errors (because of that?).

I get only 1 ASV from that file, even with OMEGA_A=0.99. (the second ASV seen below is a random contamination with a completely different sequence)

If I use SINGLETONS=T option, it creates ASV from random noisy errors, just found in one of the sequences.

I can clearly see some patterns (see middle part of that image, if there are 2 C instead of T the chance for the A to miss are very high). This pattern was amplified from all 5 leaves in independant PCR reactions, so I am confident it is real.

DADA2 fails to detect the pattern (which is a substitution in >10% of sequences) but instead weighs it equal to the noisy SNPs in the rest of the sequence.

Is there any parameter I can change to make it easier for random SNP (with just one occurence in the file) to be "corrected" but not for positions which have like 10% substituions? Would you recommend running DADA2 on pre-CCS files?

Is there any way to set the HOMOPOLYMER parameter only for 5 consecutive bases or more? It is too sensitive for me.

grafik


dada(temp_derep, err=err1, BAND_SIZE=32, multithread=TRUE, verbose = 2, OMEGA_A = 0.99, OMEGA_C = 1e-10,
     HOMOPOLYMER_GAP_PENALTY = 0, DETECT_SINGLETONS=F, GAP_PENALTY = -6, pool = TRUE)

7 samples were pooled: 1189 reads in 1136 unique sequences.
, Division (naive): Raw 1 from Bi 0, pA=1.22e-01
New Cluster C1:SS, No Division. Minimum pA=1.26e+01 (Raw 2 w/ 2 reads in Bi 0).
ALIGN: 2272 aligns, 0 shrouded (1136 raw).
$`MHPAF1172-11.fastq.gz`
dada-class: object describing DADA2 denoising results
1 sequence variants were inferred from 80 input unique sequences.
Key parameters: OMEGA_A = 0.99, OMEGA_C = 0.0000000001, BAND_SIZE = 32

$`MHPAF1173-11.fastq.gz`
dada-class: object describing DADA2 denoising results
2 sequence variants were inferred from 120 input unique sequences.
Key parameters: OMEGA_A = 0.99, OMEGA_C = 0.0000000001, BAND_SIZE = 32

$`MHPAF1174-11.fastq.gz`
dada-class: object describing DADA2 denoising results
1 sequence variants were inferred from 4 input unique sequences.
Key parameters: OMEGA_A = 0.99, OMEGA_C = 0.0000000001, BAND_SIZE = 32

$`MHPAF857-11.fastq.gz`
dada-class: object describing DADA2 denoising results
2 sequence variants were inferred from 265 input unique sequences.
Key parameters: OMEGA_A = 0.99, OMEGA_C = 0.0000000001, BAND_SIZE = 32

$`MHPAF861-11.fastq.gz`
dada-class: object describing DADA2 denoising results
2 sequence variants were inferred from 264 input unique sequences.
Key parameters: OMEGA_A = 0.99, OMEGA_C = 0.0000000001, BAND_SIZE = 32

$`MHPAF862-11.fastq.gz`
dada-class: object describing DADA2 denoising results
2 sequence variants were inferred from 190 input unique sequences.
Key parameters: OMEGA_A = 0.99, OMEGA_C = 0.0000000001, BAND_SIZE = 32

$`MHPAF866-11.fastq.gz`
dada-class: object describing DADA2 denoising results
1 sequence variants were inferred from 222 input unique sequences.
Key parameters: OMEGA_A = 0.99, OMEGA_C = 0.0000000001, BAND_SIZE = 32

testcase2_align.zip

EDIT: I downgraded the CCS to 0.99 by-strand, and seet MIN_ABUNDANCE=3 and it is worlds different. It actually works now. I guess if (error introduced before sequencing)>(error in the sequencing reaction) the noise slips by either undetected or everything gets squashed down because the noise which is present does not have quality scores associated to it.

Still, some things don't make sense to me. grafik This ASV gets just one sequence assigned but the substitution it represents is present in moer than 5% of the sequences. MHPAF1173-11.zip

benjjneb commented 1 year ago

It's definitely not clear to me what is going on here, other than things aren't working as expected in the original CCS 0.999 data. That said, some of the parameter choices you are trying here are extreme, and almost certainly not a good idea. PacBio CCS does not make homopolymer errors (at a higher rate than normal indels), and so HOMOPOLYMER_GAP_PENALTY shouldn't be used at all. OMEGA_A=0.99 is so extreme, that I can't imagine a situation where it would be appropriate to use.

I guess if (error introduced before sequencing)>(error in the sequencing reaction) the noise slips by either undetected or everything gets squashed down because the noise which is present does not have quality scores associated to it.

There could be some truth here? Fundamentally, DADA2 is modeling the error process (PCR+sequencing). If "errors" are introduced outside of that process (e.g. large numbers of "contaminants"), such that the majority of errors are coming from those non-modeled processes, then DADA2's algorithm could essentially fail. Maybe that's what is going on here?

On principle, I would be a bit concerned that you are seeing >95% singleton reads from a ~650nt amplicon sequencing with CCS (although, what is the instrument/chemistry that generated these CCS reads?). There is a rule of thumb that when <10% of reads are duplicates of other reads, that DADA2 might not be the right algorithmic choice, although usually that is because it indicates the underlying sequencing is too noisy.