Variable length amplicons

dleopold commented 8 years ago

Currently, DADA2 requires equal length amplicons for input. Is there any chance that variable length sequences will be possible in future updates of DADA? I often work with fungal ITS data and would like to be able to apply a denoising approach such as this.

benjjneb commented 8 years ago

This is a desirable feature, and there is nothing in principle stopping DADA2 from dealing with variable length amplicons. However there is some non-trivial work needed to make this happen, and I don't have the development time to do it right now.

I am making this an enhancement request, and recording a few of the placed where the code-based will need updating for when I can get to this:

The vectorized aligner needs to handle differently sized sequences
The input checking needs to individually match the length of quality vectors/strings
Banding needs to scale with the difference between sequence lengths (C-side)
The kmer-screening needs to be tested (and adjusted?) on varying amplicon lengths

leffj commented 8 years ago

As a temporary workaround, could you process fungal ITS paired Illumina reads (say 2x250) without trimming adapters through dada2 so they are the the same length and then trim adapters off before using mergePairs()?

dleopold commented 8 years ago

One problem with this is that many amplicons will be shorter (some much shorter) than 250 bp and the 3' end of the read will be junk. I am not sure how dada would deal with that.

leffj commented 8 years ago

Yes, that is my concern, too. It could still work, though, if the 'junk' ends are consistent. @benjjneb thoughts?

benjjneb commented 8 years ago

My thoughts are that its a clever idea and seems to me like it could work in principle, but also could run into the kind of issues-in-practice that Dylan is bringing up. I have almost no experience with ITS data myself, so my intuition isn't worth much on this one.

If you want to give it a try, I'd love to hear if it works or doesn't work.

dleopold commented 8 years ago

I was just taking another look at this and I am not sure how it would work. At least for ITS1, there can be >150 bp variation in amplicon length. For shorter reads this means that the sequencing goes past the end of the opposite end adapter and the remaining sequence is all Q=2. There are still bases called for these junk tails, but I highly doubt they are consistent. Also, waiting to trim the sequences would require skipping the quality filtering as well, so it does not seem feasible to me.

leffj commented 8 years ago

I've actually had some success with this. I will have to look more into why these sequences with the junk tails are passing quality filtering, but they are. I believe I used similar thresholds as in the tutorial. I was able to trim the sequences in the dada class objects based on the primer sequences and then merge paired reads. The merged seqs do span 200+ bp and include seqs well below the read length. It does seem true that there is some variability in the 'junk' regions, so after trimming, there are duplicate seqs, but these are merged when building the seq table. Of course, not the most elegant solution, but it seems to be a usable workaround for me for now. Also, for what it's worth, the overall results (taxa and sample dissimilarities) seem pretty similar to what I'm getting with another pipeline.

benjjneb commented 8 years ago

The new vectorized aligner handles sequence of variable length (4a7cdbf6f839d083954788520c5f886163341da0).

There is still work to do to allow the algorithm-as-a-whole to handle variable lengths, as well as choices to be made (how to include length differences in the error model?). But the core infrastructure is now ready to support variable amplicon lengths.

benjjneb commented 7 years ago

I now have a version of the core algorithm that works with variable length amplicons (checked into the ITS branch).

However, what would be very helpful would be some test data on an ITS dataset with variable length reads. Ideally a mock dataset, but any reasonably simple community would be great.

I tested on the mock communities in doi: 10.1128/AEM.02576-16 and DADA2 performed better than the best results they reported, but those ITS reads were trimmed to a constant length.

Any help here much appreciated! I don't have experience with ITS data.

benjjneb commented 7 years ago

As of version 1.3.3 (cddce98a98894e4a621f6767e61fe9aaed638e06) dada2 supports ITS. This includes handling variable length amplicons and assigning taxonomy against the UNITE database.

In my testing on 3 different mock communities (all Illumina sequenced) the accuracy of dada2 was as exceptional on ITS data as it was on 16S data, but happy to hear about issues that crop up for others, as ITS data is not my specialty.

joey711 commented 7 years ago

NICE! I got a lot of questions about ITS in Cesky Krumlov. Glad to hear it is coming along.

Would love to see some tests from ITS gurus that have their own carefully validated data. Hopefully someone can jump in there.

Does this also relate to effectiveness on 454 data?

benjjneb commented 7 years ago

@joey711 No effect for 454 data really, as global trimming is fine for 454. We already added the HOMOPOLYMER_GAP_PENALTY feature which addresses 454's tendency to make mistakes on homopolymers.

dleopold commented 7 years ago

Can you provide a little detail about how the variable length sequences are handled? and about the basic sequence pre-processing workflow you used for ITS data. Specifically, how do you do quality trim the variable length sequences (particularly for reads that end with a very low quality tail because the read length is longer than the amplicon for some but not all amplicons). or are these tails removed during the denoising process? Typically I would use some type of quality-aware length trimming (e.g. a sliding window trimmer) or use a paired-end merging algorithm that trims the overhangs after merging. However, the former approach would result in variable length reads for the same "true" sequence. Is that OK for the denoising algorithm?

benjjneb commented 7 years ago

@dleopold Below is the workflow I used on some paired-end test data, for which the forward/reverse primer sequences had already been removed:

library(dada2); packageVersion("dada2")
fastqPairedFilter(c(fnF, fnR), c(filtF, filtR), minLen=50, truncQ=2, maxEE=2, rm.phix=TRUE)
drpF <- derepFastq(filtF)
ddF <- dada(drpF, err=NULL, selfConsist=TRUE, multithread=TRUE)
drpR <- derepFastq(filtR)
ddR <- dada(drpR, err=NULL, selfConsist=TRUE, multithread=TRUE)
mm <- mergePairs(ddF, drpF, ddR, drpR, verbose=TRUE)
cc <- removeBimeraDenovo(mm, verbose=TRUE)

The primer removal step (performed prior to the dada2 workflow above) had trimmed both primers, including where the forward reads had read into the reverse primer and vice-versa. The truncQ parameter trimmed off tails where the quality had crashed.

The dada algorithm does not count length variation as differentiating between sequences (technically: end-gaps in alignments are ignored). Thus, amplicons from the same sequence with variable lengths due to trimming low quality tails are collapsed together, with the most abundant such sequence representing the group. After merging, you should get SVs that span from forward to reverse primer regardless.

The one spot to keep a watch for the output is the mergePairs step, to check that your paired reads are spanning the amplicon and being merged.

benkraj commented 7 years ago

Hi-

I was working through the variable length ITS workflow, and just wanted to check that it's correct that I'm supposed to get a few "Not all sequences were the same length" messages throughout the derepFastq, dada, and makeSequenceTable() commands.

I know my reads have variable length--I was just making sure there wasn't a flag or something I needed to put into this command to say that they're variable length by design.

Thanks for adding this functionality.

benjjneb commented 7 years ago

@benkraj

Yes, you are free to ignore those messages. We probably should start removing those messages now that variable length amplicons are supported.

mvannuland commented 7 years ago

Hi @benjjneb,

Thanks for the detailed responses above. I was starting to proceed with your suggested workflow, and ran into a small issue right away. I run:

fastqPairedFilter(c(fnFs, fnRs), c(filtFs, filtRs), minLen=50, truncQ=2, maxEE=2, rm.phix=TRUE)

And it gives me "Error in fastqPairedFilter(c(fnFs, fnRs), c(filtFs, filtRs), minLen = 50, : Two paired input file names required." I looked at the fastqPairedFilter documentation and couldn't tell if I needed to compress and directly save the directory of fastq files as an object or if there's someway to keep the path definitions?

benjjneb commented 7 years ago

@mvannuland

fastqPairedFilter is not "vectorized" in the R sense -- you can't give it a vector of fnFs and fnRs etc, it has to get them one at a time. So you either need to put this inside a loop, or use the newer filterAndTrim function which is available in package version 1.4. Examples of both:

for(i in seq_along(fnFs)) {
  fastqPairedFilter(c(fnFs[[i]], fnRs[[i]]), c(filtFs[[i]], filtRs[[i]]), minLen=50, truncQ=2, maxEE=2, rm.phix=TRUE)
}
# Or...
filterAndTrim(fnFs, filtFs, rev=fnRs, filt.rev=filtRs, minLen=50, truncQ=2, maxEE=2, rm.phix=TRUE, multithread=TRUE)
# May have to turn off multithreading on windows

mvannuland commented 7 years ago

Yep, that fixed the issue. Thanks!

shandley commented 6 years ago

Hi @benjjneb,

I have a question regarding your statement in this thread:

The dada algorithm does not count length variation as differentiating between sequences (technically: end-gaps in alignments are ignored). Thus, amplicons from the same sequence with variable lengths due to trimming low quality tails are collapsed together, with the most abundant such sequence representing the group. After merging, you should get SVs that span from forward to reverse primer regardless.

Is this still how the algorithm works? If so, then what is the purpose of collapseNoMismatch? Wouldn't the dada algorithm already have collapsed sequences with 100% identity with variable lengths?

benjjneb commented 6 years ago

@shandley collapseNoMismatch is primarily intended for more unusual circumstances. For example, barcodes that were different lengths in different samples. Or combining a run of 515F-805R with 515F-806R. Or when heterogeneity spacers were included. Basically anything that might introduce systematic length variation between samples or sets of samples.

In standard setups it's not needed.

shandley commented 6 years ago

Thanks @benjjneb,

So is the default to still 'collapse' shorter reads with 100% identity into a single ASV?

So if there were two ASV's like this:

ATGCAGAGAGGCAT ATGCAGAGAGG

They would just be a single ASV?

Would the same be true for this?:

ATGCAGAGAGGCAT --GCAGAGAGG---

We are working on using dada2 to resolve metagenomic sequence variants which are of varying length and would like to understand how these situations are dealt with.

Thanks!

Scott

benjjneb commented 6 years ago

We are working on using dada2 to resolve metagenomic sequence variants which are of varying length and would like to understand how these situations are dealt with.

Well... if there is large variation in length there is a side-effect of the BAND_SIZE parameter that leads to length variants that are larger than the BAND_SIZE to be separated by the algorithm. They would be collapsed back together by collapseNoMismatch, but output as separate variants by dada.

Depending on how far off label you are (i.e. viral metagenomics) what we see is that this effectively results in a "tiling" of the viral genomes with stacks of error-corrected variants with start positions every ~BAND_SIZE nucleotides along the genome.

benjjneb / dada2

Variable length amplicons #55