benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Reads Loss during DADA2 run #2027

Open mentorwan opened 1 week ago

mentorwan commented 1 week ago

We ran a full-length Pacbio DADA2 analysis. Here is a question we encountered during the process: There is some minor read loss during the DADA2 process. For example, in one sample, stats.tsv shows 24,049 non-chimera reads, but the DADA2-generated biom file or qzv file or taxonomy table shows only 24,025 reads—a loss of 24 reads.

I previously thought the number of reads would match the number of non-chimera reads after QC. Although this read loss is minimal, I checked other samples: some show no loss while others have very few lost reads.

Maybe it’s not a significant issue. Could you clarify our understanding or provide any related information we might be missing? Thanks.

benjjneb commented 1 week ago

For example, in one sample, stats.tsv shows 24,049 non-chimera reads, but the DADA2-generated biom file or qzv file or taxonomy table shows only 24,025 reads

Can you clarify what workflow you are using and how these different numbers are being generated?

mentorwan commented 1 week ago

The workflow we use is HiFi Full length 16S workflow: https://github.com/PacificBiosciences/HiFi-16S-workflow

The number is generated by output from this pipeline. Here is table in stats.tsv related to this sample:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

sample-id | input | filtered | denoised | non-chimeric | percentage of input non-chimeric -- | -- | -- | -- | -- | -- SC830317 | 39431 | 24600 | 24132 | 24049 | 60.99

But in DADA2_table.qzv file, we can see that for this sample, only 24025 reads assigned. There are 24 reads differences.