benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 142 forks source link

Sample Inference Error allocating memory for vectors #769

Closed REngstrand closed 5 years ago

REngstrand commented 5 years ago

I'm trying to run the dada command to apply the sample inference algorithm to dereplicated data for my 16S samples. It fails for my dadaFs, saying it cannot allocate vector of a different size each time I run it. I don't think my data set is that big with only 120 samples and my PC has 16GB of RAM so I figured it could handle this & I wouldn't have to do the big data pipeline. Should I be switching to big data pipeline or is there another fix for this? I'm noticing that my reads and unique sequence numbers for each sample are pretty close to one another whereas in the DADA2 tutorial, they are not, so maybe my issue is I should be filtering my samples more?

dadaFs <- dada(derepFs, err=errF, multithread=TRUE)

Sample 1 - 16077 reads in 16033 unique sequences.

Sample 2 - 35972 reads in 35807 unique sequences.

Error: cannot allocate vector of size 71.0 Mb

just in case... the output from

head(out) reads.in reads.out RCE-1-A1_S1_L001_R1_001_trimmed.fastq 19158 16077 RCE-1-A10_S73_L001_R1_001_trimmed.fastq 41160 35972 RCE-1-A11_S81_L001_R1_001_trimmed.fastq 25793 23126 RCE-1-A12_S89_L001_R1_001_trimmed.fastq 15699 7081 RCE-1-A2_S9_L001_R1_001_trimmed.fastq 1445 1200 RCE-1-A3_S17_L001_R1_001_trimmed.fastq 17159 13936

benjjneb commented 5 years ago

I'm noticing that my reads and unique sequence numbers for each sample are pretty close to one another whereas in the DADA2 tutorial, they are not, so maybe my issue is I should be filtering my samples more?

We can fix the memory issue, but this is the more important concern. This is a warning sign that something isn't quite right. What do your quality profile plots look like, and what filtering did you do? Are you sure this is amplicon data, and not shotgun data? What environment are you sampling?

REngstrand commented 5 years ago

Ruhroh

My quality profile plots look alright, from what I can tell by just comparing them to the tutorial data, this is my first dataset. Quality plots are below

Rs Fs

The filtering step I ran is: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(260,150),maxN=0, maxEE=c(2,8), truncQ=2, rm.phix=TRUE,compress=TRUE, multithread=FALSE)

The sequences are definitely amplicon data, sampling bacteria in tropical forest soils and tropical mining soils

benjjneb commented 5 years ago

Hm... that all looks pretty reasonable. If you don't have it yet. could you update to version 1.12 of the DADA2 package (which we can also use to avoid the memory error) and check for low-complexity sequences in your data, with e.g. plotComplexity(fnF.example)? Also, can you check the taxonomy of the most abundant sequences in an example F and R read file, e.g.

drp <- derepFastq(fnF.example) # or fnR.example
unname(assignTaxonomy(drp$uniques[1:10], "path/to/silva_trainset.fasta"))

Finally, are you sure your reads do not have primers on them? You've not removed primers by trimLeft so if they were on your reads they would remain and if they have ambiguous bases could be explaining the low levels of repeated sequences in your data. What is the primer set you are using? And what is the output of substr(getSequences(drp)[1:10], 1, 20)?

Basically we're trying to see if there is a technical explanation for # uniques ~ # reads.

REngstrand commented 5 years ago

I updated from 1.11 to 1.12 and re-ran everything. I no longer have the memory issues for the pipeline tutorial and can get through to the end, but I can't run any of your troubleshooting steps. I've tried a few times and tried de-bugging but have had no luck. I do believe that my primers were removed, I used cutadapt on everything before running the tutorial but I will try to figure out how to check that in case that's the big issue here.

plotComplexity(fnFs) Error: 'Calloc' could not allocate memory (38903119 of 1 bytes)

unname(assignTaxonomy(derepFs$uniques[1:10], "C:/Users/rache/Desktop/cutadapt_data/16S/silva_nr_v132_train_set.fa.gz"))

Error in getUniques(object, collapse = collapse, silence = silence) : Unrecognized format: Requires named integer vector, dada-class, derep-class, sequence matrix, or a data.frame with $sequence and $abundance columns.

substr(getSequences(derepFs)[1:10], 1, 20) Error in getUniques(object, collapse = collapse, silence = silence) : Unrecognized format: Requires named integer vector, dada-class, derep-class, sequence matrix, or a data.frame with $sequence and $abundance columns.

benjjneb commented 5 years ago

Those commands want just one sample, not the list of all your samples. If you replace fnFs with fnFs[[1]] and derepFs$uniques[1:10] with derepFs[[1]]$uniques[1:10] they should work.

REngstrand commented 5 years ago

Thanks for the clarification... trying as [1] was one of my troubleshoots, can't believe I forgot that second bracket! That does work and yields the following:

plotcomplexity

unname(assignTaxonomy(derepFs[[1]]$uniques[1:10], "C:/Users/rache/Desktop/cutadapt_data/16S/silva_nr_v132_train_set.fa.gz")) [,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae" "Chthoniobacterales" "Chthoniobacteraceae" "Candidatus_Udaeobacter" [2,] "Bacteria" "Rokubacteria" "NC10" "Rokubacteriales" NA NA
[3,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales" "Xanthobacteraceae" NA
[4,] "Bacteria" "Rokubacteria" "NC10" "Rokubacteriales" NA NA
[5,] "Bacteria" "Actinobacteria" "Actinobacteria" "Streptomycetales" "Streptomycetaceae" "Kitasatospora"
[6,] "Bacteria" "Rokubacteria" "NC10" "Rokubacteriales" NA NA
[7,] "Bacteria" "Actinobacteria" "Thermoleophilia" "Gaiellales" NA NA
[8,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae" "Chthoniobacterales" "Chthoniobacteraceae" "Candidatus_Udaeobacter" [9,] "Archaea" "Thaumarchaeota" "Nitrososphaeria" "Nitrososphaerales" "Nitrososphaeraceae" NA
[10,] "Bacteria" "Firmicutes" "Bacilli" "Bacillales" "Planococcaceae" "Chungangia"
substr(getSequences(derepFs[[1]])[1:10], 1, 20) [1] "TACAGAGGTCTCAAGCGTTG" "GACAGAGGGGGCAAGCGTTG" "TACGAAGGGGGCTAGCGTTG" "AAGGTGCCAGCCGCCGCGGT" "AGCGTGTCAGCAGCCGCGGT" [6] "AGTGTGTCAGCAGCCGCGGT" "ATGTGTGTCAGCAGCCGCGG" "CAAGTGTCAGCAGCCGCGGT" "CCCGTGTCAGCCGCCGCGGT" "CGTAGTGTCAGCCGCCGCGG"

benjjneb commented 5 years ago

That all looks pretty reasonable, perhaps a small low-complexity mode, but not something that would explain repeated sequences being that low...

Would you mind sharing a single sample with me? I'd be interested to poke around a little bit. Also can you clarify the primer set you are using?

REngstrand commented 5 years ago

Absolutely happy too! Thank you for all of your help. I'm using the improved 515F and 806R primers for 16S. How should I share them with you?

REngstrand commented 5 years ago

I sent an email with the files to the email you listed in your GitHub profile. Thank you for your help! Let me know if you prefer the data another way

benjjneb commented 5 years ago

Got the files thanks, may be a few days before I get to them though.

benjjneb commented 5 years ago

I took a look at the forward file you got, and I believe your amplicons were sequenced using heterogeneity spacers that have not been removed. For example, I identified a set of 12 sequences in the raw data, all present as singletons, and all assigned to the "Angustibacter" genus. I visualized the first 25 nts of all these sequences and osberved the following:

  GGATGTGTCAGCAGCCGCGGTAATA
TGTAAAGTGTCAGCCGCCGCGGTAA
ACTGTGTCAGCAGCCGCGGTAATAC
GTATTGGTGTCAGCAGCCGCGGTAA
  CGCTGTGCCAGCCGCCGCGGTAATA
 GTTACGTGCCAGCAGCCGCGGTAAT
   AAGGTGTCAGCAGCCGCGGTAATAC
  CCCAGTGTCAGCAGCCGCGGTAATA
   TATGTGCCAGCAGCCGCGGTAATAC
   CGTGTGTCAGCCGCCGCGGTAATAC
  GTTTGTGTCAGCAGCCGCGGTAATA
  TCTTGTGTCAGCCGCCGCGGTAATA

That is, they are all identical subsequent to the intial stretch of sequences that varies in both base composition and length. These probable heterogeneity spacers are added as a way of making the amplicon sequencing library heterogeneous in the beginning to help Illumina base-calling, whereas most of the time people just add phiX or mix in genomic DNA they want to sequence anyway.

Unfortunately we do not currently have an automated solution to deal with heterogeneity spacers. But to use DADA2, these need to be removed prior to beginning the pipeline, as the pipieline is assuming that all reads are starting from the same position in the amplicon, and that is violated by this sort of library preparation.