benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

Cannot Assign taxonomy #401

Closed SaiPrabhas closed 6 years ago

SaiPrabhas commented 6 years ago

Hello. I have tried to run the command to assign taxonomy which is,

taxa <- assignTaxonomy(seqtab.nochim, "C:\Users\Genomics\Desktop\Sai\DADA2\silva_nr_v128_train_set.fa.gz", multithread=TRUE)

Which is returning the following error: Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, : Memory allocation failed. Coming to the second problem. Less than 50% of my reads are passing remove chimeras step. I tried the pipeline with one sample and this is the table it returned.

                   input    filtered    denoised merged tabled nonchim

LBplantarum 327062 208458 208458 137240 137240 66114

I am certain that there were no adapter sequences present as on board trimming on illumina miseq was opted. Please help me with this issue.

benjjneb commented 6 years ago

Memory allocation failed.

How much memory do you have available? Is it less than 8GB?

I am certain that there were no adapter sequences present as on board trimming on illumina miseq was opted. Please help me with this issue.

Are there primer sequences on your reads? The Illumina software will remove Illumina adapters, but it won't remove PCR primers if they are included in the sequenced part of your amplicon.

SaiPrabhas commented 6 years ago

Hello! Thanks for the response. The available memory is 128GB so I don't think it was the problem which is why I reported this to you.

I didn't check if the primer sequences are present. I will try to trim them off and run the pipeline. But the memory allocation error still needs to be addressed .

Thank you.

benjjneb commented 6 years ago

128GB should be way more than enough memory. What version of the dada2 R package are you using?

If you are running this on a cluster, are you sure your R process actually has access to 128GB of memory?

SaiPrabhas commented 6 years ago

I am using the latest version '1.6.0'. I am running it on a windows PC locally on R for windows so I have all the memory available.

benjjneb commented 6 years ago

Can you try running with multithread=FALSE and see if memory allocation error persists?

ALso, what is the output of memory.size(max = TRUE)?

SaiPrabhas commented 6 years ago

I tried it with multithread=FALSE as well and the result was the same.

For memory.size(max = TRUE) the out put was 340.69

benjjneb commented 6 years ago

Try requesting more memory:

memory.limit(size=8000)

SaiPrabhas commented 6 years ago

memory.limit(size=8000) [1] 130994 Warning message: In memory.limit(size = 8000) : cannot decrease memory limit: ignored

This was the output.

SaiPrabhas commented 6 years ago

memory.limit(size=8000) Error in memory.limit(size = 8000) : don't be silly!: your machine has a 4Gb address limit this is the output in R where I am running the pipeline. Maybe this is the issue. The previous output was from R studio.

benjjneb commented 6 years ago

Ah, is this 32-bit windows? Yeah, the 4GB limit is the problem.

SaiPrabhas commented 6 years ago

No it is 64 bit. But I think R installed both 32 bit and 64 bit versions as I have both of them on my desktop. Is using R Studio a better option? And Can I set up my memory limit to 130994 which it showed before to decrease my computation time?

benjjneb commented 6 years ago

I use Rstudio because its awesome, but the key thing here is you need to use 64-bit R and have access to 8GB of memory. So whatever solution gets you to that point.

SaiPrabhas commented 6 years ago

That was a great help. Thank you very much.

SaiPrabhas commented 6 years ago

Hello! I have come across a section about primer trimming in your FAQ where you mentioned that we can use trimleft argument. Can you please specify where to use it? Can I use it in the filtering and trimming step by specifying some value like trimleft=50? Thank you.

benjjneb commented 6 years ago

Can I use it in the filtering and trimming step by specifying some value like trimleft=50?

Yep, although 50 is almost certainly too much. You should trim off the length of the primer you used, so for example 515F is 19nts long, so trimLeft=19 would be appropriate for single-end sequencing. If using paired reads, say with 515F/806R primer pair, you want to specify the appropriate separate lengths for the forward/reverse trimming, i.e. trimLeft=c(19,20) in that case.

NOTE: This is all assuming that your primers are on the start of your forward (and reverse) reads. In many protocols (e.g. EMP) the primers aren't sequenced. In that case, the trimLeft=0 default is almost always the right choice.

SaiPrabhas commented 6 years ago

Hello! That worked. Thank you. Once I got the taxonomy assigned, I saved the table as a text file. I have used two samples for the analysis but I see there is no differentiation in the table. How can I differentiate between the species identified in each sample?

benjjneb commented 6 years ago

The dada2 R package only handles the bioinformatics portion, it does not do any of the subsequent analysis (e.g. differential abundance analysis). I would recommend looking at the phyloseq R package, and our F1000 workflow for further microbiome analysis ideas.

SaiPrabhas commented 6 years ago

Ya I understand that but my question was If I use multiple samples for the analysis, How can I know the bacterial population for each sample? I used a positive control and a negative control for the analysis but in the end I got only one table. How can I differentiate between these samples?

benjjneb commented 6 years ago

Did you make a sequence table (i.e. with makeSequenceTable)? That is the object that holds the counts of each sequence variant in each sample in a matrix.

The taxonomy table is a map from sequence variants to assigned taxonomies - it does not have any information on what was seen in each sample. That's all in the sequence table.

SaiPrabhas commented 6 years ago

Thanks for being so patient and replying promptly. Yes I did make the table and saved it as an RDS file. But I have no Idea how to relate the sequence table and the taxonomy table. Suppose I have two different samples and ran them together in DADA2 pipeline is there a way I can know the bacterial species which are found in them separately?

nagasridhar commented 6 years ago

Hi @SaiPrabhas.

The sequence table and the taxonomy table are directly related. The first sequence in your sequence table is the first sequence assigned taxonomy in the taxonomy table. Both the tables are sorted based on the abundance numbers, i.e., from the higher abundance to the lower abundance.

Both sequence table and taxonomy table are matrices and if you run the command,

head(sequencetable) and,

head(taxonomytable)

you should be able to see a taxonomic assignment to the first sequence in the sequence table to the first sequence in the taxonomy table and so on.

jshleap commented 5 years ago

Hi! I am running into the same memory allocation problem, but in my case is not the architecture.

sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Pop!_OS 18.04 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_3.5.1

I am using DADA2 version 1.11.0 in R3.5.1. I am trying to train with a 25G file. My input files have been trimmed with cutadapt.

I have 128G of ram available and have tried both multithread = TRUE and multithread = FALSE

Any help will be appreciated!

Best,

Sergio

benjjneb commented 5 years ago

I am using DADA2 version 1.11.0 in R3.5.1. I am trying to train with a 25G file. My input files have been trimmed with cutadapt.

Can you tell me more about the 25G file you are training with? This is a custom reference database you've constructed? How many sequences are in it?

jshleap commented 5 years ago

Thanks for your prompt answer!!! Yes it is a custom COI database with 3439916 sequences. I formatted the headers to contain all the lineage in seven categories: Kingdom;Phylum;Class;Order;Family;Genus;Species Am I doing it wrong? also, this is not an alignment, should the sequences be aligned?. I just read your closed issue #302. I think I do have some empty fields in some of the headers. Is this not allowed in your Classifier?

benjjneb commented 5 years ago

should the sequences be aligned?

No, in fact they should not be aligned and should not contain gap characters.

I think I do have some empty fields in some of the headers. Is this not allowed in your Classifier?

Empty fields are allowed, but only at the tail end of the taxonomic levels. I.e.

> Kingdom; Phylum;

is OK, and is treated as the lower ranks being undefined for that entry, but

> Kingdom; Order; Family;

is not. I.e. skipping ranks is not supported.

Yes it is a custom COI database with 3439916 sequences.

I suspect the issue here is related to the less than ideal (super-linear) memory scaling of our implementation of the naive Bayesian classifier with respect to the size of the reference database. This is a known issue #239 but we have not had a chance to make progress on it yet. For now, you would need to find a way to condense your reference database to a smaller size if you want to use assignTaxonomy. A common approach in the field is to deduplicate sequences with the same taxonomic assignment to some degree, for example by clustering as some threshold (99%?) and keeping only the centroid sequence, at least in the case where the cluster all shared taxonomic assignments.

cpavloud commented 5 years ago

I also have the same issue. I am using the latest version, on a windows PC with Rstudio.

library(dada2); packageVersion("dada2") [1] ‘1.10.1’

taxa_silva <- assignTaxonomy(seqtab.nochim, "silva_132.18s.99_rep_set.dada2.fa.gz") Error in C_assign_taxonomy(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, : Memory allocation failed.

I should have enough memory...

memory.size(max = TRUE) [1] 8095.25

I tried also

taxa_silva <- assignTaxonomy(seqtab.nochim, "silva_132.18s.99_rep_set.dada2.fa.gz", multithread=FALSE)

but I keep getting the same error.

benjjneb commented 5 years ago

@cpavloud The workaround for now is to find a machine with more memory, the Silva 18S data seems to take a bit more memory than the 16S. I think 16GB are enough, it also seems to be that windows runs into memory walls sooner than on OSX/Unix OSes even for the same amount of physical memory.

You could also consider the alternative taxonomy assignment using DECIPHER::IdTaxa, although I'm not sure about the availability of a pre-trained Silva 18S model for that function.

cpavloud commented 5 years ago

I don't think there is the possibility of using the DECIPHER package for Silva 18S.

I am now running DADA2 in an HPC cluster, so there is no memory error. :)

SharonOrtiz commented 3 years ago

Hi, I am using Galaxy DADA2 and I am getting this error, could someone help me?

Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, : Memory allocation failed. Calls: assignTaxonomy -> C_assign_taxonomy2 In addition: Warning message: In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, : reading FASTA file /galaxy-repl/main/files/059/949/dataset_59949784.dat: ignored 8 invalid one-letter sequence codes Execution halted

Th5P commented 2 years ago

Hi, I am using Galaxy DADA2 and I am getting this error, could someone help me?

Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, : Memory allocation failed. Calls: assignTaxonomy -> C_assign_taxonomy2 In addition: Warning message: In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, : reading FASTA file /galaxy-repl/main/files/059/949/dataset_59949784.dat: ignored 8 invalid one-letter sequence codes Execution halted

Hello, Sharon! How did you solve the issue? I'm running into the same issue, either in Galaxy and R studio.

Thank you :)