benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

AlignSeqs error #1762

Closed tandrafraser closed 1 year ago

tandrafraser commented 1 year ago

Hi,

I received an error at the alignment stage of the dada2 pipeline alignment = AlignSeqs(Biostrings::DNAStringSet(setNames(df$sequence, df$id)), anchor = NA)

46%error in DistanceMatrix(seqs, type="dist", verbose = verbose, processors, : RAW() can only be applied to a "raw" but not "builtin"

When I ran the pipeline using a few sequences from the run to make sure everything was working there was no issue. My sequences are 16S V4 paired-end reads from a NovaSeq run and it took 2 days to run the alignment step so I would like to fix it before I try again. Any ideas on how to fix this before trying again and why it would behave differently when running the full set would be greatly appreciated. Thanks.

benjjneb commented 1 year ago

This isn't part of the dada2 package, but rather is the alignment implemented in the DECIPHER package.

Perhaps @digitalwright can offer a suggestion on what might be going on here?

digitalwright commented 1 year ago

@tandrafraser Please provide a reproducible example of the error.

tandrafraser commented 1 year ago

This isn't part of the dada2 package, but rather is the alignment implemented in the DECIPHER package.

Perhaps @digitalwright can offer a suggestion on what might be going on here?

Thanks for the quick reply. I am using another program for alignment. I think I had too many sequences for DECIPHER to handle.

digitalwright commented 1 year ago

@tandrafraser If you could provide an example it would be appreciated. DECIPHER has been used to align hundreds-of-thousands of sequences. If there is a bug then it would be great to know.

Phylloxera commented 1 year ago

Hi @digitalwright,

I've got a large 16s dataset (129K seqs) that I'd like advice on aligning with DECIPHER. The out of memory error on my PC, came pretty quick after calling AlignSeqs() "Error: cannot allocate vector of size 61.7 Gb". After this, I assigned the job to one of our agency HPCs. It chugged for about 11 hours before the cgroup out-of-memory handler killed it. My HPC slurm specs were 48 cores and 144GB Memory and my command was AlignSeqs(DNAStringSet(seqs), anchor=NA, processors=NULL) . This is how far it got:

Determining distance matrix based on shared 8-mers:
================================================================================

Time difference of 9915.66 secs

Clustering into groups by similarity:
================================================================================

Time difference of 2340.98 secs

Aligning Sequences:
===============================================================================

How much memory should I request for round 2? verbose? Sorry for posting here, but @benjjneb 's awesome 2016 protocol is my primary reference, at least until deep diving into DESeq2. :grimacing:

Phylloxera commented 1 year ago

update: got some advice from ArtOfAlignmentInR . I'll post here if new code breeds success!

digitalwright commented 1 year ago

@Phylloxera Your initial error was likely due to insufficient memory to make a distance matrix. You could supply a guideTree in this case, or use a different computer as you did. Your second error looks like the alignment contains non-homologous sequences, and it is expanding in width until running out of memory. Since non-homologous positions should not be in the same alignment column/site, the number of columns/sites can increase until the alignment becomes intractably large. My suggestion is to figure out which sequences are not 16S and exclude them from the input. You could, for example, use the Clusterize() function with a high cutoff (i.e., low similarity) to do this, and/or you could reject any that IDTAXA doesn't classify as 16S (e.g., "unclassified_Root"). Also, are you only aligning the unique sequences (or ASVs)? Most datasets yield fewer than 129k ASVs, and there is typically no point in aligning redundant/duplicate sequences.

Phylloxera commented 1 year ago

THANKS @digitalwright !!! A week wasn't enough time with processors=48... probably because of the intractably expanding width. I found 43 unclassified seqs and removed. I'm trying to run Clusterize() on my MacBook, but expect it to fail (fan is crankin', though). It looks like Clusterize() is a newer function as my R4.2.0 DECIPHER (32 cores) could not find it. I'm not sure what 'large' should be, but I'm trying 0.5. I had not been surprised at the number of ASVs since I'm dealing with 310 libraries from 14 illumina runs with avg seq count ~quarter million... but maybe I should still be surprised. I had been planning to assess outliers using some funcs applied to phyloseq objects from @benjjneb 2016 workflow, ordinate(... distance="unifrac"), I think, but this does not work without a tree. But maybe, I need to be thinking about outlier ID using different approaches and thus trim down the data a little more before aligning. Thoughts welcome! Also, happy to move thread to bioconductor.

digitalwright commented 1 year ago

My guess is that there are spurious sequences that aren't 16S in the data. My suggestion to classify all of the ASVs with IDTAXA using an appropriate training set, and remove any sequences that aren't 16S if you haven't done this already.

Phylloxera commented 1 year ago

I probably over-purged down to 84k sequences, which were aligned in 40h using 72 cores