benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 142 forks source link

Pacbio HiFi long time at learnErrors + other issues. #1882

Closed nathanjohns closed 3 months ago

nathanjohns commented 7 months ago

Hello,

I'm currently trying to use dada2 to analyze some pacbio hifi amplicon data that is for near full rRNA operons (~ 4kb). I have a few questions about the optimal way to run dada2 for my data.

I've been following your guides for the zymo mock community and fecal samples for pacbio and when running a subset of ~ 25k reads I can get to the dada and isBimeraDenovo steps no problem and the number of sequences returned at the end seems reasonable.

My dataset includes 4 samples which were run on the same Sequel IIe flow cell and are each ~ 1-1.5 million reads before trimming and filtering. I went to try to process one of them and I have a few questions, my last two commands are below.

drp <- derepFastq('20231010_rRNA.demux.0--0.trimfiltmerge.fastq.gz', verbose=TRUE) Dereplicating sequence entries in Fastq file: 20231010_rRNA.demux.0--0.trimfiltmerge.fastq.gz Encountered 93299 unique sequences from 454842 total sequences read. err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) 1803269856 total bases in 454842 reads from 1 samples will be used for learning the error rates. Connected to your session in progress, last started 2024-Feb-02 19:38:07 UTC (1 hour ago)

1) My main problem at the moment is that the learnErrors is taking a while. I'm running it on an r6a.16xlarge AWS instance (64 vCPUs, 512.0 GiB of memory) and it has been going for almost 24 hr at the time I'm writing this. Is this expected given ~ 500k 4kb reads? I double checked with htop and all CPUs are indeed firing at close to 100% while memory usage is only at 27.8/493G.

2) Is it possible to run a smaller subset of reads to get the error profile which can be used with dada for the larger set of reads?

3) If the samples were sequenced together on the same flowcell, do I need to run learnErrors for each of them? The strain content should be largely overlapping.

4) Is the dada function also expected to take a long time on my larger read set?

5) I lost a lot of reads (down ~70% for first sample) during my filtering step (code below). I used the code from the pacbio zymo example as a template, however, I realize that because my reads are longer that the maxEE of 2 may not be optimal? The sequencing center told me that my library was median Q39.

track <- fastqFilter("20231010_rRNA.demux.0--0.trimmed.0.fastq.gz", "20231010_rRNA.demux.0--0.trimfilt.0.fastq.gz", minQ=20, minLen=3500, maxLen=4500, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)

Also, I received an error when I tried to filter the whole primer-trimmed sample, I can't access the actual message I got back because learnErrors is still running but I think it was the same as someone else on here and would have been this: Dereplicating sequence entries in Fastq file: "20231010_rRNA.demux.0--0.trimmed.fastq.gz" Error in asMethod(object) : long vectors not supported yet: memory.c:3717

For now I don't mind the extra stringent filter because I am still feeling out the pipeline but in the long run I'd to know if there is a solution other than splitting, filtering, and remerging.

6) Lastly, my experiment is a simulated fecal transplant, one sample is the native host community, one is a synthetic donor community, and the third is the resulting community from the mouse after some time which is presumably a mixture of the two. Assuming I sort out the above issues, my plan was to de novo generate the ASVs for the first two communities individually. However, I'm not sure what to do for the mixture community. My ultimate goal would be to determine the source of each read in the mixture as having come from the host or the invading community. For many I think this will be easy if a species is only in one of the communities. However, for species shared between the two I'm hoping the long amplicons let me differentiate as many as possible. Preliminary bioinformatic analysis of genomes suggests that in many cases I should be able to in many cases although not all. Do you have any recommendations for how I could do this type of analysis with dada2? Would I just generate ASVs denovo and then taxonomically map them back to a reference DB that combines the ASVs from the two sources?


Quick update - it did finish after ~ 26 hours, I'm running the following now: dd1 <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE)

I'm still interested in the issues described above so that I can run this efficiently.

cjfields commented 7 months ago

@nathanjohns see some answers below, though I'm curious what Ben says.

drp <- derepFastq('20231010_rRNA.demux.0--0.trimfiltmerge.fastq.gz', verbose=TRUE) Dereplicating sequence entries in Fastq file: 20231010_rRNA.demux.0--0.trimfiltmerge.fastq.gz Encountered 93299 unique sequences from 454842 total sequences read. err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) 1803269856 total bases in 454842 reads from 1 samples will be used for learning the error rates. Connected to your session in progress, last started 2024-Feb-02 19:38:07 UTC (1 hour ago)

  1. My main problem at the moment is that the learnErrors is taking a while. I'm running it on an r6a.16xlarge AWS instance (64 vCPUs, 512.0 GiB of memory) and it has been going for almost 24 hr at the time I'm writing this. Is this expected given ~ 500k 4kb reads? I double checked with htop and all CPUs are indeed firing at close to 100% while memory usage is only at 27.8/493G.

This is pretty much what I have been seeing for longer read data at high depth on a standard cluster (24-48 cores for the job). We push up time requirements for large data sets >24hrs for learnErrors and dada (and sometimes the chimera removal steps) in case we time out on the request.

  1. Is it possible to run a smaller subset of reads to get the error profile which can be used with dada for the larger set of reads?

nreads in learnErrors I think. Not sure if there is a recommended min threshold for PacBio HiFi, I normally go with the default.

  1. If the samples were sequenced together on the same flowcell, do I need to run learnErrors for each of them? The strain content should be largely overlapping.

My impression is that data from each sequencing run per unit (Illumina lane, PacBio SMRTcell) should be run together. Multiple ones should be run separately to estimate errors per run, combined after (there is a function in dada2 for combining seq tables), then running chimera detection on everything. But would be good to have that clarified.

  1. Is the dada function also expected to take a long time on my larger read set?

I found that the learnErrors and dada steps take roughly the same time, maybe a little more for learnErrors.

  1. I lost a lot of reads (down ~70% for first sample) during my filtering step (code below). I used the code from the pacbio zymo example as a template, however, I realize that because my reads are longer that the maxEE of 2 may not be optimal? The sequencing center told me that my library was median Q39.

track <- fastqFilter("20231010_rRNA.demux.0--0.trimmed.0.fastq.gz", "20231010_rRNA.demux.0--0.trimfilt.0.fastq.gz", minQ=20, minLen=3500, maxLen=4500, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)

Also, I received an error when I tried to filter the whole primer-trimmed sample, I can't access the actual message I got back because learnErrors is still running but I think it was the same as someone else on here and would have been this: Dereplicating sequence entries in Fastq file: "20231010_rRNA.demux.0--0.trimmed.fastq.gz" Error in asMethod(object) : long vectors not supported yet: memory.c:3717

For now I don't mind the extra stringent filter because I am still feeling out the pipeline but in the long run I'd to know if there is a solution other than splitting, filtering, and remerging.

We have sporadic but similar issues with individual samples that have > 500k reads when dereplicating in R. I found that the long vectors not supported yet: memory.c error with derepFastq can be worked around by setting the n value lower:

dereps <- derepFastq(filts, n=100000, verbose=TRUE)

Regarding read loss: just checking but are your PacBio HiFi reads 99.9% accuracy based on CCS? Our core doesn't report median QV but the CCS read accuracy. If so, I don't think this workflow works well with the default CCS 99% accuracy which is what many providers give you, though they should (hopefully) also provide the BAM file. If you have this you can check the reads using samtools pretty easily using samtools view, the samtools expression syntax, and the rq tag in the BAM. Something like:

samtools view -e '[rq] >= .999' ccs.bam | samtools bam2fq - > ccs.fastq

I also have bumped est errors to 4, but I don't recall that having a huge impact.

  1. Lastly, my experiment is a simulated fecal transplant, one sample is the native host community, one is a synthetic donor community, and the third is the resulting community from the mouse after some time which is presumably a mixture of the two. Assuming I sort out the above issues, my plan was to de novo generate the ASVs for the first two communities individually. However, I'm not sure what to do for the mixture community. My ultimate goal would be to determine the source of each read in the mixture as having come from the host or the invading community. For many I think this will be easy if a species is only in one of the communities. However, for species shared between the two I'm hoping the long amplicons let me differentiate as many as possible. Preliminary bioinformatic analysis of genomes suggests that in many cases I should be able to in many cases although not all. Do you have any recommendations for how I could do this type of analysis with dada2? Would I just generate ASVs denovo and then taxonomically map them back to a reference DB that combines the ASVs from the two sources?

Quick update - it did finish after ~ 26 hours, I'm running the following now: dd1 <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE)

I'm still interested in the issues described above so that I can run this efficiently.

Apart from the strategy you would employ to generate ASVs for your samples, I would think this would happen post-dada2, using specific analytical libraries in R or plugins in QIIME2, and really depends on your overall experimental design including how many replicates you have per group. If these were all produced on one flow cell, I'd run them all together and use a pooling approach to capture as much information as possible, but this depends on what your ultimate goals are. If you don't care about singletons, you could opt to run learnErrors on all the data, then run dada on the individual samples.

Also, keeping in mind I'm speaking as someone who has worked on lots of microbiome projects but hasn't worked on one with this particular design, I'd naively look at tools that focused on simple comparative (how many ASVs are in common across samples) or network analysis. And I'd check into the human gut synthetic microbiome community literature which would likely include applications such as those you describe. I do know of a few researchers here who have done something similar in their groups, I could ask around.

The only concern I would have is the sensitivity of the assay, since you can distinguish between rRNA copies within one genome using 16S-ITS-23S. Do you need that level of detection for this. Just curious, but have you looked into the FANGORN database?

nathanjohns commented 6 months ago

@cjfields thanks for your extensive comments. I ended up completing analysis for some of my samples after subsampling the reads to avoid the memory error.

  1. I will try nnreads for my next run. My samples are fairly deeply sequenced (~ 1 million reads before filtering) so hopefully I can save some time doing learnErrors for a subset of reads.

  2. That is good to know as I would expect my samples within the same run to have similar error profiles. Perhaps there is a way I can empirically evaluate this since I saved the error profile variables for each of my 4 samples. This will be my last Sequel IIe run and going forward everything will be Revio so I will have quite a bit more samples/data and running learnerrors for each sample in full would take forever.

  3. I found that the dada time was significantly less than for learnerrors but still on the order of hours.

  4. I need to check with the sequencing core about their cutoff, I think there is a chance it was 99% rather than 99.9, however my runs were median Q40 so I would think they are pretty good overall. I'll check my samples. However, I do wonder if my maxEE is too low given that my fragments are longer (4kb vs normal full length 16S 1.5 kb). Thanks for the tip on the n value, I think that was definitely hanging up my data and splitting+remerging was getting annoying.

Apart from the strategy you would employ to generate ASVs for your samples, I would think this would happen post-dada2, using specific analytical libraries in R or plugins in QIIME2, and really depends on your overall experimental design including how many replicates you have per group. If these were all produced on one flow cell, I'd run them all together and use a pooling approach to capture as much information as possible, but this depends on what your ultimate goals are. If you don't care about singletons, you could opt to run learnErrors on all the data, then run dada on the individual samples.

Also, keeping in mind I'm speaking as someone who has worked on lots of microbiome projects but hasn't worked on one with this particular design, I'd naively look at tools that focused on simple comparative (how many ASVs are in common across samples) or network analysis. And I'd check into the human gut synthetic microbiome community literature which would likely include applications such as those you describe. I do know of a few researchers here who have done something similar in their groups, I could ask around.

The only concern I would have is the sensitivity of the assay, since you can distinguish between rRNA copies within one genome using 16S-ITS-23S. Do you need that level of detection for this. Just curious, but have you looked into the FANGORN database?

I'm pretty happy with the results from my first pass. I did the entire pipeline for each sample separately and ~ 2/3 of my ASVs from the mixed community matched perfectly and uniquely to an ASV from one of the source communities. Furthermore, many of the non-perfect ASVs can be explained by being 1) very similar to an ASV in one of the source communities, 2) probably being lost as low abundance in one of the source communities, 3) being a chimera that didn't get filtered out. I think some tinkering with the parameters at the various stages and analyzing more reads will go a long way towards solving some of these. Given that I had a rather stringent quality filter, I will have to see how letting in some lower quality reads will affect my analysis although with future runs on the Revio I can afford to sequence deeper / sample.

Thanks for pointing me in the direction of FANGORN - I had not seen that and will look into it further. We've assembled our own database from closed genomes for strains in our lab but we find that it is actually a very problematic region to assemble. Even the zymo mock community example provided found that the reference sequences had mistakes at the intra-genomic level - so I am a little skeptical of any reference-based approaches. We have found that errors can be introduced when making hybrid assemblies during the short read polishing step. In my amplicon data I do see multiple ASVs at similar abundances for many species but I will need to make some synthetic mock samples to see whether any ASVs are lost or have errors. Thanks for your help!