benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Error in assignTaxonomy on a BIG seqtab #1429

Closed padpadpadpad closed 1 month ago

padpadpadpad commented 2 years ago

Hi

This problem is very similar to Issue 961.

The sequencing is from a big Novaseq run of the protein coding gene rpoB so I do have 400,000 unique sequences to assign. Is there anyway to do this in batches?

An earlier run of ~40,000 sequences did assign fine. Which makes me think it is the number of sequences in seqtab rather than the reference database.

When trying to assign taxonomy. I am getting an error:

Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, : negative length vectors are not allowed Calls: assignTaxonomy -> C_assign_taxonomy2

When running it on a server. The server info is:

R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

I run it currently with 16 cores, but the error persists if I use 25. I have created a smaller reference database (going from 26000 sequences including species rank to 11000 including only down to genus level. The error persists.

Any thoughts much appreciated.

padpadpadpad commented 2 years ago

I am currently testing a work around to run assignTaxonomy in batches:


seqtab <- readRDS(paste(output_path, 'temp', 'seqtab.rds', sep = '/'))

# assign taxonomy in batches of 50,000
to_split <- seq(1, ncol(seqtab), by = 50000)
to_split2 <- c(to_split[2:length(to_split)]-1, ncol(seqtab))

taxtab = NULL

for(i in 1:length(to_split)){
  seqtab2 <- seqtab[, to_split[i]:to_split2[i]]
  taxtab2 <- assignTaxonomy(seqtab2, refFasta = ref_fasta, multithread = TRUE)
  if(!is.null(ref_fasta_spp)){taxtab2 <- addSpecies(taxtab2, refFasta = ref_fasta_spp, verbose = TRUE)}
  if(!is.null(taxtab)){taxtab <- rbind(taxtab, taxtab2)}
  if(is.null(taxtab)){taxtab <- taxtab2}
}
# save files in case phylogeny does not run
saveRDS(taxtab, paste(output_path, 'temp', 'taxtab.rds', sep = '/'))

It has done the first 50,000 so that is a start!

benjjneb commented 2 years ago

Yes you can go in batches, no difference in results since each sequence is being assigned individually w/in the function itself anyway.

mmanus commented 1 year ago

Hi @padpadpadpad, did this loop work for you? I used your example as a model for assigning taxonomy to PacBio long read sequence data. While it appears to work in the sense that a taxonomy table is created (with around 25,000 taxa), I am unable to create a downstream phyloseq object. I suspect it's because the character strings in the feature table and the taxonomy table do not fully match-- that is, the string of characters for each taxon ID appears in multiple cells within the feature table, and taxa IDs are repeated (e.g. multiple hits for C. namnetense). See below for my code and attached files that show the problem (note that I am uploading modified shorter files, due to upload file size restrictions).Any advice would be greatly appreciated, thanks in advance!

Read in OTU table

otu_table_in <- as.matrix( fread("st2_example_April.csv"), rownames = "OTUID") otu_table_in_t <- t(otu_table_in)

taxonomy <- read.csv("taxonomy_example_April.csv") taxonomy <- column_to_rownames(taxonomy, var="OTUID")

metadata <- as.data.frame( fread("metadata_full_April.csv")) metadata_new <- column_to_rownames(metadata, var = "OTUID")

venn(list(taxonomy==rownames(taxonomy),featuretable=rownames(otu_table_in_t))) # note no overlap

OTU <- otu_table(otu_table_in_t, taxa_are_rows = T) TAX = tax_table(as.matrix(taxonomy)) META <- sample_data(metadata_new)

po.full <- phyloseq(OTU, TAX, META)

Error in validObject(.Object) : invalid class “phyloseq” object:

Component taxa/OTU names do not match.

taxonomy_example_April.csv st2_example_April.csv

padpadpadpad commented 1 year ago

Hi @mmanus. I suppose what I don't understand is how the taxa names for the OTU table and taxa table do not match?

The tax table created from the sequence table automatically gives the rownames (taxa names) to be the same as for the sequence table.

The string of characters (i.e. the DNA sequence) should be unique for each row of the OTU (feature) table, because each row is a unique ASV and each column is their abundance in each sample. So if you have the EXACT same sequence has multiple rows they can be aggregated as I understand it.

mmanus commented 1 year ago

Hi @padpadpadpad thanks for the reply. We are still working on this issue, and have a follow-up question for you. In your code (pasted below), could you please clarify what 'ref_fasta' and 'ref_fasta_spp' correspond to (i.e. which training set was used)? In our assignTaxonomy loop, we are using 'refFasta = silva_nr99_v138.1_wSpecies_train_set.fa.gz' but we are not sure how to incorporate this into 'addSpecies.'

Thanks!

for(i in 1:length(to_split)){ seqtab2 <- seqtab[, to_split[i]:to_split2[i]] taxtab2 <- assignTaxonomy(seqtab2, refFasta = ref_fasta, multithread = TRUE) if(!is.null(ref_fasta_spp)){taxtab2 <- addSpecies(taxtab2, refFasta = ref_fasta_spp, verbose = TRUE)} if(!is.null(taxtab)){taxtab <- rbind(taxtab, taxtab2)} if(is.null(taxtab)){taxtab <- taxtab2} }

padpadpadpad commented 1 year ago

Hi @mmanus sorry for the slow reply.

For this one specifically I made my own custom reference database and I left the ref_fasta_spp as NULL. I generally use the fasta file silva_nr99_v138.1_wSpecies_train_set.fa.gz for assignTaxonomy() and silva_species_assignment_v138.1.fa.gz for addSpecies(). I cannot remember if assignTaxonomy() returns species level if it is included in the database, but it is not that reliable for resolving to species level anyway (https://www.biorxiv.org/content/10.1101/192211v1 and https://benjjneb.github.io/dada2/assign.html) so maybe we could use silva_nr99_v138.1_train_set.fa.gz for the assignTaxonomy().

Generally though all reference trainsets should work with this code, I just have some code earlier on that defines what the ref_fasta and ref_fasta_spp are.