benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

DADA2 with AmpliSeq data #1702

Closed ankit4035 closed 1 month ago

ankit4035 commented 1 year ago

Hi Ben,

I am big fan of DADA2 for analysing 16S data. I recently worked with AmpliSeq data generated on Illumina with paired-end. In nutshell, multiple targets (AMR genes) (>800) were amplified and sequenced. I thought it interesting to analyse that data with DADA2 to find various sequence variants for each target. But there were various problems with AmpliSeq data to be directly run on DADA2 pipeline.

  1. During library prep, primers are partially digested, therefore for the same target there could be multiple sequences with multiple lengths of primers sequenced and some having no primers as well.
  2. No directionality of the sequences as indices are ligated and not amplified like 16S protocol.

I worked out a pipeline for analysing such data starting with paired fastq files. The data was QC filtered and Illumina adapter trimmed (cutadapt).

  1. Learn the error rate for R1 and R2 separately.
  2. Sample inference for R1 and R2.
  3. Merge pairs and generate sequence tables.
  4. Merge the same sequence table with tryRC = true . This is to merge all the variants which are same but in reverse orientation. Since same table is merged with itself. All counts will be doubled, which can be rectified downstreM.
  5. Collapse table to remove mismatches with collapseNoMismatch. This is to collapse some variants which originate from same target but with different length and subset of each other. This is just a preventive measure; I don't know if this going to be useful at all.
  6. Chimera removal. I don't know if this will work well as well.
  7. Taxonomy assignment using custom database.

Can you comment on the suitability of the pipeline flow.

Thank you

benjjneb commented 1 year ago

I can only offer vague feedback as I haven't tried to use DADA2 on this type of data. My largest concern is with this feature of your data:

During library prep, primers are partially digested, therefore for the same target there could be multiple sequences with multiple lengths of primers sequenced and some having no primers as well.

DADA2 is sensitive to differences in the start positions of the reads, and seems likely to generate multiple (perhaps many) ASVs corresponding to the same allele in the situation you've described. This also has the effect of reducing sensitivity, since one AMR allele that is read 20 times, but with 20 different starting positions, is likely to be missed altogether. There probably is a way to mostly solve this, but I would consider it a major concern, and woudl make me strongly consider an alternative bioinfo approach that maps reads to the known suite of AMR genes first in order to separate them, and then proceeds from that.

In principle, collapseNoMismatch will help with that issue, but doesn't solve the loss of sensitivity and is both slow and probably not perfect.

Chimera removal. I don't know if this will work well as well.

The above starting-point variation issues could also affect chimera identification, similarly to how unremoved primers cause way too many reads to be removed as chimeras in standard 1-target amplicon sequencing data.

ankit4035 commented 1 year ago

Thanks alot for your comments and suggestions. I will try to explore the possibilities and see how it goes.