benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

PacBio 16S read lengths and counts query #1395

Closed laurencmartin closed 5 months ago

laurencmartin commented 3 years ago

Hello @benjjneb,

I was just wondering if you would mind advising on the attached histogram created using the script from your DADA2 + PacBio: Fecal Samples tutorial.
Read length distribution of r210414_Cell5_Data We have sent 40 samples (vaginal and infant stool samples as well as their controls) to a service provider for 16S rRNA sequencing on the PacBio sequel II instrument. The instrument is relatively new and we have doubts about their sequencing capability and the quality of the data that is being returned to us. The data used to produced the histogram is only from the 14 samples they’ve manage to return to us.

We have never worked with PacBio 16S data before and we are concerned by the number of reads below 500bp. Is this normal or are these primer/adapter dimers that should have been cleaned up prior to sequencing? Secondly, how many reads do you think we should expect from these sample types? We had hoped for roughly 5000 reads – I’m not entirely sure if this is a realistic expectation.

Any assistance would be greatly appreciated!

Lauren

benjjneb commented 3 years ago

That read length distribution looks fine to me. In data I have worked with it is typical to see some off-target lengths, and the large majority of your reads are in the expected length range. I would recommend imposing a length window at the filtering step though, as we did in the PacBio fecal samples tutorial.

As for the number of reads to expect, that depends on the instrument/chemistry and how much multiplexing is being done per SMRT cell as well as the quality of the library prep/sequencing, so I can't really say whether 5k is more or less than you should expect here.