benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Using DADA2 with short barcode sequences #576

Closed apcamargo closed 5 years ago

apcamargo commented 6 years ago

I'm planning to use DADA2 to denoise sequencing data of a very short barcode sequence (20bp). Aside from changing the minLen parameter is there anything else I should do?

Thank you for your attention.

benjjneb commented 6 years ago

A couple things will help performance. Set VECTORIZED_ALIGNMENT=FALSE when calling learnErrors and dada, as the vectorized algorithm is slower on really short sequences. Consider setting BAND_SIZE to something smaller than the default of 16 as well (if performance is an issue). If this is extremely deep data as in some barcode experiments, it may be worth evaluating the OMEGA_A cutoff. Also, to be safe I'd probably set USE_KMERS = FALSE, or at least test if it matters, as the kmer-cutoffs were designed for longer sequences.

If paired sequencing that covers the entire barcode: Merge first (w/ external program), extract barcode, then run DADA2 will probably improve sensitivity to really rare stuff a bit.

@cd-mcfarland may have more suggestions, as he explored DADA2 on barcodes pretty deeply in his work developing TubaSeq for tumor barcoding, see also the Methods in that paper: https://doi.org/10.1038/nmeth.4297

Also, it will probably work pretty well as is, but a lot of the above suggestions are to maximize both the speed and the sensitivity/specificity on really rare stuff in high-depth barcode libraries.

apcamargo commented 6 years ago

Thank you for your response,

In a test run, I used BAND_SIZE=8 and I could find 3608 out of 6363 barcodes. I'm considering setting OMEGA_A to 1e-45. Do you think this could help?

I'll try disabling USE_KMERS and VECTORIZED_ALIGNMENT.

I didn't know that I could merge before denoising. I thought that this would interfere with the denoising step. I'll try that too.

benjjneb commented 6 years ago

I'm considering setting OMEGA_A to 1e-45. Do you think this could help?

I'd try pre-merging and turning off USE_KMERS first. Do you know what the expected barcodes are? If so, you can tell dada what they are with priors=c(BARCODE_SEQ1, BARCODE_SEQ2,...) and it will raise sensitivity to those barcodes w/o harming specificity.

I didn't know that I could merge before denoising. I thought that this would interfere with the denoising step. I'll try that too.

It's OK if the entire region you are running dada on is merged, rather than a mixture of unmerged fwd, merged, and unmerged reverse as is usual in longer amplicons.

apcamargo commented 6 years ago

Hi Benjamin,

A follow up of what I did after your suggestions: I merged the reads using VSEARCH (as it uses the exact posterior Q computation described by Robert Edgar), removed F and R the primers with cutadapt and used the resulting FASTQ files as input to DADA2.

The script I executed is reproduced below. As you suggested, I turned USE_KMERS and VECTORIZED_ALIGNMENT off and used the barcodes sequences as priors.

library(dada2)

setDadaOpt(OMEGA_A = 1e-40, BAND_SIZE = 10, USE_KMERS = FALSE, VECTORIZED_ALIGNMENT = FALSE)

getN <- function(x) sum(getUniques(x))

barcodes <- scan(snakemake@input$BARCODES, character())
fn <- sort(snakemake@input$FASTQ)
sample.names <- sapply(strsplit(basename(fn), "_merged_trimmed_primers.fastq.gz"), `[`, 1) 

filt <- file.path("/opt_hd/chemical_genomics/DADA2", paste0(sample.names, "_filtered.fastq.gz"))

out <- filterAndTrim(fn, filt, minLen= 16, maxN = 0, maxEE = 1, minQ = 2, compress = TRUE, multithread = snakemake@threads)

learned.err <- learnErrors(filt, MAX_CONSIST = 100, multithread = snakemake@threads)

derep <- derepFastq(filt, verbose = TRUE)

names(derep) <- sample.names

dada.reads <- dada(derep, err = learned.err, priors = barcodes, pool = TRUE, multithread = snakemake@threads)

seqtab <- makeSequenceTable(dada.reads)

track <- cbind(out, sapply(dada.reads, getN))
colnames(track) <- c("input", "filtered", "denoised")
track <- cbind("sample" = sample.names, track)
rownames(track) <- c()
write.table(track, file = snakemake@output$TRACK, sep = "\t", row.names = FALSE, quote = FALSE)

count.table <- t(seqtab)
count.table <- cbind("BARCODE" = rownames(count.table), count.table)
rownames(count.table) <- c()
write.table(count.table, file = snakemake@output$ASVTAB, sep = "\t", row.names = FALSE, quote = FALSE)

uniquesToFasta(seqtab, fout = snakemake@output$ASV_FASTA, ids = colnames(seqtab), width = 80)

This gave me just 1198 ASVs, far from the numbers I got without read merging (~5000). Now I'm running the algorithm again setting OMEGA_A to 1e-30. (EDIT: OMEGA_A = 1e-30 gave me 1228 ASVs)

As I understand, using merged reads should increase the sensitivity because the higher Q scores would make the λji values and, therefore, the p-values lower. I'm still trying to figure out what could have happened here.

apcamargo commented 6 years ago

The first time I executed the pipeline was with DADA2 1.6, so maybe this is something related to OMEGA_C. I'd have to check the track file to see whether a considerable amount of reads weren't denoised, but I doubt that that many ASVs are singletons.

benjjneb commented 6 years ago

Hm.... something I didnt' think of. Two new heuristics were added in 1.8 that you might want to try turning off as well GREEDY=FALSE and GAPLESS=FALSE. I don't think they would be causing this, but the reality is we don't have these very short barcode sequences in our test suites so if it's a difference between 1.6 and 1.8 that is a possible explanation.

apcamargo commented 6 years ago

setDadaOpt(OMEGA_A = 1e-40, BAND_SIZE = 10, USE_KMERS = FALSE, VECTORIZED_ALIGNMENT = FALSE, OMEGA_C = 0, GREEDY = FALSE, GAPLESS = FALSE) gave me 1185 ASVs. I'll test the merged reads in version 1.6.0 without the priors and report the results here as soon as it's finished.

apcamargo commented 6 years ago

Using merged reads in DADA2 1.6 gave me 1159 ASVs . This indicates that the merged reads are the culprits here, but I don't know exactly why this is happening.

apcamargo commented 6 years ago

Is there any possibility that DADA2 erroneously detected that my sequences are ASCII base 64 instead of base 33? Most of the sequences have Q=41 in all their bases (so, the FASTQ is full of "J"s). Maybe the algorithm is processing my sequences as if they are PHRED64, in which "J" would be Q=10.

I've had this problem before with DADA2 when I was processing FASTQs with sequences in which almost all bases were "J"s. It removed all my reads during filtering because they were precessed as bad quality reads.

This is what I think may be happening: filterAndTrim removed the reads containing low-quality bases, so just reads that had "I"s and "J"s were passed to the learnErrors and dada functions. Because of this, these functions may have guessed that the reads are PHRED64, instead of PHREAD33 ("A" though "J" are common to both PHRED33 and PHRED64).

benjjneb commented 6 years ago

Is there any possibility that DADA2 erroneously detected that my sequences are ASCII base 64 instead of base 33?

It's possible, we rely on the ShortRead package to read in the quality scores and I don't know exactly how it autodetects the encoding. There is an easy way to check though, by inspecting the quality scores read in by derepFastq. E.g:

foo <- derepFastq(fn)
summary(as.vector(foo$quals))

Are the quality scores showing up as the right values (e.g. lots of 41s)?

apcamargo commented 6 years ago

Yes, they seem to be right.

> summary(as.vector(derep$"C1110GR3_S134_filtered.fastq.gz"$quals))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3.00   41.00   41.00   39.71   41.00   41.00   23572 
apcamargo commented 6 years ago

I did this for two FASTQ files. What would happen when sequences are pooled in dada?

I believe that there's at least one sample (probably many more) that are detected as PHRED64. Maybe pooling these samples into dada and learnErrors is messing things up.

Is there a way of setting the encoding environment-wide?

benjjneb commented 6 years ago

Is there a way of setting the encoding environment-wide?

There isn't right now, but I'd recommend checking if that is the case directly. You can just loop through the files and do the same type of inspection as above on each one.

Also, we have seen the best results with pre-merged reads when using the usearch merging method, as it seems to give more realistic quality scores. Maybe that would be worth trying as an alternative to whatever method you are currently using?

apcamargo commented 6 years ago

I'm already using the USEARCH method (VSEARCH implementation). I'll try to loop through all samples then. Thank you!