Closed REngstrand closed 5 years ago
I'm noticing that my reads and unique sequence numbers for each sample are pretty close to one another whereas in the DADA2 tutorial, they are not, so maybe my issue is I should be filtering my samples more?
We can fix the memory issue, but this is the more important concern. This is a warning sign that something isn't quite right. What do your quality profile plots look like, and what filtering did you do? Are you sure this is amplicon data, and not shotgun data? What environment are you sampling?
Ruhroh
My quality profile plots look alright, from what I can tell by just comparing them to the tutorial data, this is my first dataset. Quality plots are below
The filtering step I ran is: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(260,150),maxN=0, maxEE=c(2,8), truncQ=2, rm.phix=TRUE,compress=TRUE, multithread=FALSE)
The sequences are definitely amplicon data, sampling bacteria in tropical forest soils and tropical mining soils
Hm... that all looks pretty reasonable.
If you don't have it yet. could you update to version 1.12 of the DADA2 package (which we can also use to avoid the memory error) and check for low-complexity sequences in your data, with e.g. plotComplexity(fnF.example)
? Also, can you check the taxonomy of the most abundant sequences in an example F and R read file, e.g.
drp <- derepFastq(fnF.example) # or fnR.example
unname(assignTaxonomy(drp$uniques[1:10], "path/to/silva_trainset.fasta"))
Finally, are you sure your reads do not have primers on them? You've not removed primers by trimLeft
so if they were on your reads they would remain and if they have ambiguous bases could be explaining the low levels of repeated sequences in your data. What is the primer set you are using? And what is the output of substr(getSequences(drp)[1:10], 1, 20)
?
Basically we're trying to see if there is a technical explanation for # uniques ~ # reads.
I updated from 1.11 to 1.12 and re-ran everything. I no longer have the memory issues for the pipeline tutorial and can get through to the end, but I can't run any of your troubleshooting steps. I've tried a few times and tried de-bugging but have had no luck. I do believe that my primers were removed, I used cutadapt on everything before running the tutorial but I will try to figure out how to check that in case that's the big issue here.
plotComplexity(fnFs) Error: 'Calloc' could not allocate memory (38903119 of 1 bytes)
unname(assignTaxonomy(derepFs$uniques[1:10], "C:/Users/rache/Desktop/cutadapt_data/16S/silva_nr_v132_train_set.fa.gz"))
Error in getUniques(object, collapse = collapse, silence = silence) : Unrecognized format: Requires named integer vector, dada-class, derep-class, sequence matrix, or a data.frame with $sequence and $abundance columns.
substr(getSequences(derepFs)[1:10], 1, 20) Error in getUniques(object, collapse = collapse, silence = silence) : Unrecognized format: Requires named integer vector, dada-class, derep-class, sequence matrix, or a data.frame with $sequence and $abundance columns.
Those commands want just one sample, not the list of all your samples. If you replace fnFs
with fnFs[[1]]
and derepFs$uniques[1:10]
with derepFs[[1]]$uniques[1:10]
they should work.
Thanks for the clarification... trying as [1] was one of my troubleshoots, can't believe I forgot that second bracket! That does work and yields the following:
unname(assignTaxonomy(derepFs[[1]]$uniques[1:10], "C:/Users/rache/Desktop/cutadapt_data/16S/silva_nr_v132_train_set.fa.gz")) [,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae" "Chthoniobacterales" "Chthoniobacteraceae" "Candidatus_Udaeobacter" [2,] "Bacteria" "Rokubacteria" "NC10" "Rokubacteriales" NA NA
[3,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales" "Xanthobacteraceae" NA
[4,] "Bacteria" "Rokubacteria" "NC10" "Rokubacteriales" NA NA
[5,] "Bacteria" "Actinobacteria" "Actinobacteria" "Streptomycetales" "Streptomycetaceae" "Kitasatospora"
[6,] "Bacteria" "Rokubacteria" "NC10" "Rokubacteriales" NA NA
[7,] "Bacteria" "Actinobacteria" "Thermoleophilia" "Gaiellales" NA NA
[8,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae" "Chthoniobacterales" "Chthoniobacteraceae" "Candidatus_Udaeobacter" [9,] "Archaea" "Thaumarchaeota" "Nitrososphaeria" "Nitrososphaerales" "Nitrososphaeraceae" NA
[10,] "Bacteria" "Firmicutes" "Bacilli" "Bacillales" "Planococcaceae" "Chungangia"
substr(getSequences(derepFs[[1]])[1:10], 1, 20) [1] "TACAGAGGTCTCAAGCGTTG" "GACAGAGGGGGCAAGCGTTG" "TACGAAGGGGGCTAGCGTTG" "AAGGTGCCAGCCGCCGCGGT" "AGCGTGTCAGCAGCCGCGGT" [6] "AGTGTGTCAGCAGCCGCGGT" "ATGTGTGTCAGCAGCCGCGG" "CAAGTGTCAGCAGCCGCGGT" "CCCGTGTCAGCCGCCGCGGT" "CGTAGTGTCAGCCGCCGCGG"
That all looks pretty reasonable, perhaps a small low-complexity mode, but not something that would explain repeated sequences being that low...
Would you mind sharing a single sample with me? I'd be interested to poke around a little bit. Also can you clarify the primer set you are using?
Absolutely happy too! Thank you for all of your help. I'm using the improved 515F and 806R primers for 16S. How should I share them with you?
I sent an email with the files to the email you listed in your GitHub profile. Thank you for your help! Let me know if you prefer the data another way
Got the files thanks, may be a few days before I get to them though.
I took a look at the forward file you got, and I believe your amplicons were sequenced using heterogeneity spacers that have not been removed. For example, I identified a set of 12 sequences in the raw data, all present as singletons, and all assigned to the "Angustibacter" genus. I visualized the first 25 nts of all these sequences and osberved the following:
GGATGTGTCAGCAGCCGCGGTAATA
TGTAAAGTGTCAGCCGCCGCGGTAA
ACTGTGTCAGCAGCCGCGGTAATAC
GTATTGGTGTCAGCAGCCGCGGTAA
CGCTGTGCCAGCCGCCGCGGTAATA
GTTACGTGCCAGCAGCCGCGGTAAT
AAGGTGTCAGCAGCCGCGGTAATAC
CCCAGTGTCAGCAGCCGCGGTAATA
TATGTGCCAGCAGCCGCGGTAATAC
CGTGTGTCAGCCGCCGCGGTAATAC
GTTTGTGTCAGCAGCCGCGGTAATA
TCTTGTGTCAGCCGCCGCGGTAATA
That is, they are all identical subsequent to the intial stretch of sequences that varies in both base composition and length. These probable heterogeneity spacers are added as a way of making the amplicon sequencing library heterogeneous in the beginning to help Illumina base-calling, whereas most of the time people just add phiX or mix in genomic DNA they want to sequence anyway.
Unfortunately we do not currently have an automated solution to deal with heterogeneity spacers. But to use DADA2, these need to be removed prior to beginning the pipeline, as the pipieline is assuming that all reads are starting from the same position in the amplicon, and that is violated by this sort of library preparation.
I'm trying to run the dada command to apply the sample inference algorithm to dereplicated data for my 16S samples. It fails for my dadaFs, saying it cannot allocate vector of a different size each time I run it. I don't think my data set is that big with only 120 samples and my PC has 16GB of RAM so I figured it could handle this & I wouldn't have to do the big data pipeline. Should I be switching to big data pipeline or is there another fix for this? I'm noticing that my reads and unique sequence numbers for each sample are pretty close to one another whereas in the DADA2 tutorial, they are not, so maybe my issue is I should be filtering my samples more?
just in case... the output from