output fate of every read

audy commented 7 years ago

I'm doing some benchmark comparisons to other pipelines. For this, I need to be able to output the following for each read:

read passes quality filtering
read is chimeric
read is overlapping
read is a valid sequence variant
sequence variant for read

("read pair" can be synonymous with "read" in this case)

Is there currently a way to do this with dada2?

benjjneb commented 7 years ago

There is currently no implemented way to record which reads were lost in the filtering step. That would have to be done by reading the raw and filtered files back in and comparing them.

derepFastq: The output of this function contains the $map of each read to the index of its corresponding dereplicated unique sequences.

dada: The output of this function contains a $map of each input dereplicated unique sequence to the index of its correponding denoised sequence variant.

mergePairs: Actually, no I don't think this is possible to follow on a per-read basis, hadn't thought about that one before. If you turn on returnRejects=TRUE you can get a data.frame with all the unique pairs of F/R unique sequences, including those that failed to merge.

removeBimeraDenovo: This will remove a defined set of sequence variants (which can be mapped back to reads), so by using base R commands, e.g. sq.chim <- getSequences(st)[!getSequences(st) %in% getSequences(st.nochim)] one can extract the sequences removed as chimeras and then follow them back.

It would be nice if this was easier though. While possible, it would require a lot of heavy lifting in terms of R hacking right now.

lucas-nebert commented 6 years ago

In addition to mapping which reads go to which sequence variant, is there any way of indicating the confidence level by which a read is assigned the sequence variant? As I understand it, DADA2 bins all reads into the predicted sequence variants. This can result in rare taxa (below DADA2 detection) incorrectly being subsumed into more abundant sequence variants. Perhaps there is even a way to set a confidence threshold, such that only the most likely reads are assigned to a sequence variant, and the lower-confidence reads are output separately for alternative OTU-picking strategies.

I have been working on a plant bacterial endophyte dataset, dominated by plant samples, with a few soil samples. I found I'm able to predict which soil the plant originated from with regular 97% clustering methods, but not with DADA2, even when pooling all samples in the dada() step. I need to look into this more, but my hunch is that DADA2 is binning rare taxa into common sequence variants, and as a result, it is less capable of finding differences (or similarities) between samples driven by rarer taxa. My dataset is relatively low-depth (average ~3600 reads), which means I am likely to have more "rare" taxa than the ideal dataset.

joey711 commented 6 years ago

Confidence

Yes, a confidence threshold is either specified by you, or (more commonly) the default value is used (omega, I think). If your concern is increasing specificity of a read assignment to a sequence, then you can tune this threshold.

Also, a reminder that dada2 is not an OTU-picking strategy. You are always free to attempt OTU clustering on dada2 output (sequences/abundances) if that fulfills your immediate/quick requirements. I do not recommend this as a long term strategy for amplicon sequencing, as we have published.

Prediction

Be careful in your interpretation here, because there are many explanations for the predictive modeling result that you described, and perhaps only one of them would be a failing of dada2.

A very common explanation for what you described is overfitting, because OTU-clustering methods generate many arbitrary (not real) OTUs that are esoteric to a sample because they are simply chance events. If you were not careful to remove these noisy nuisance variables from your prediction, they would appear to be highly predictive of your soil, which you would then reasonably interpret to also mean they were informative.

The fact that a much less noisy but higher-resolution processing of the sequences results in a lower-performance prediction suggests to me (on first hearing without seeing the data or the prediction method that you used, etc) that the soil samples you're comparing are actually harder to reliably classify than you previously thought. An easy (but coarse) test of this explanation is to perform a 97% OTU-clustering (e.g. UPARSE) on the dada2 ASVs as mentioned above. If this does not suddenly make the prediction comparable to your original OTU-clustering-based prediction result, then it is probably not a matter of the more-aggregated taxonomic level being a better predictor.

It's hard for us to diagnose without knowing more details about your OTU clustering and prediction method, seeing the data, etc., but I do want to emphasize that you should question your assumption that the OTU-to-prediction result is correct and anything not reproducing that is wrong.

benjjneb commented 6 years ago

Perhaps there is even a way to set a confidence threshold, such that only the most likely reads are assigned to a sequence variant, and the lower-confidence reads are output separately for alternative OTU-picking strategies.

This is a good idea! I have been vaguely thinking about something like that, maybe it's time to add such a feature.

I'll note that there is a crude way to do this now. You can replace the abundances output by dada with the $n0 column in the dada()$clustering data.frame. That is equivalent to using only reads that have 0 mismatches to the inferred ASV.

I have been working on a plant bacterial endophyte dataset, dominated by plant samples, with a few soil samples. I found I'm able to predict which soil the plant originated from with regular 97% clustering methods, but not with DADA2, even when pooling all samples in the dada() step. I need to look into this more, but my hunch is that DADA2 is binning rare taxa into common sequence variants, and as a result, it is less capable of finding differences (or similarities) between samples driven by rarer taxa. My dataset is relatively low-depth (average ~3600 reads), which means I am likely to have more "rare" taxa than the ideal dataset.

As @joey711 said, there are a lot of things that might go into that, so its hard to pinpoint a specific issue. Again as @joey711 mentioned, I would strongly urge you to consider changing the default OMEGA_A if appropriate. OMEGA_A is a single parameter that controls the sensitivity of dada2. It is set quite conservatively by default (OMEGA_A=1e-40), but it is perfectly valid to set a much "smaller" value (e.g. OMEGA_A=1e-4) especially in low-depth/high-complexity samples where more sensitivity is desirable.

lucas-nebert commented 6 years ago

Exciting! I will try changing the OMEGA_A parameter to perhaps uncover more rare sequence variants (though at the risk of increasing false positives).

However, is it not still the case that all reads are retained after the dada() function, implying that some reads from rare sequence variants my be incorrectly subsumed into more common sequence variants? Filtering reads out based on confidence (or similarity to the ASV, as @benjjneb suggests) may help avoid calling them something they are not.

I apologize, used the word "predict" very loosely. More specifically, using 97% clustering, I found plants share more bacterial OTU's with the soils from where they originated, compared to other soils - as one would expect. I did not see this using DADA2/ASVs. However, I will tinker with the OMEGA_A parameter and see if this improves my ability to resolve more minute similarities/differences between samples.

Many thanks for your quick, helpful responses, @joey711 and @benjjneb

jjmmii commented 6 years ago

I am looking forward to this feature for printing reads that failed to merge.

mergePairs: Actually, no I don't think this is possible to follow on a per-read basis, hadn't thought about that one before. If you turn on returnRejects=TRUE you can get a data.frame with all the unique pairs of F/R unique sequences, including those that failed to merge.

I tried returnRejects=TRUE, then x=dadaFs[["sample_1"]], and inspected x, but the rows where accept=FALSE (assuming this means a rejected read pair) did not show the sequence (i.e. first column is empty). For example:

> head.list(x,50)
$sequence
 [1] "TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATGTCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACACGGTAAACGATGAATACTCGCTGTTTGCGATATACGGCAAGCGGCCAAGCGAAAGCGTTAAGTATTCCACCTGGGGAGTACGCCGGCAACGGTG"
(truncated...)
[47] ""
(truncated....)
$accept
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

May I please know how to get the rejected sequences?

benjjneb commented 6 years ago

May I please know how to get the rejected sequences?

You can get the forward and reverse sequences from the rejected pairs, but there is no rejected merged sequence generated. To do that, use the indices in the $fwd and $rev columns of the merger data.frame to index back into the dada-class objects:

rejects <- merger[!merger.accept,]
rejects.fwd.seq <- dadaF$sequence[rejects$fwd]
rejects.rev.seq <- dadaR$sequence[rejects$fwd]

jjmmii commented 6 years ago

Thanks so much for your reply. The first line did not work:

> rejects <- mergers[!mergers.accept,]
Error: object 'mergers.accept' not found

I guess you meant

> x_with_rejects<-mergers_with_rejects[["sample_1"]]
> nrow(x_with_rejects)
[1] 1123
> rejects<-x_with_rejects[!x_with_rejects$accept,]
> nrow(rejects)
[1] 768
> head(rejects)
   sequence abundance forward reverse nmatch nmismatch nindel prefer accept
   61                119       2       1     26         4      0      2  FALSE
   70                100       3       1     26         4      0      2  FALSE
   71                100       3       6     28         2      0      1  FALSE
   74                 91       7       1     22         7      1      2  FALSE
   83                 67       6       1     24         6      0      2  FALSE
   91                 53       8       1     28         2      0      2  FALSE
> rownames <- as.numeric(rownames(rejects))

Then I tried to get 1123 rows from dadaFs[["sample_1"]], but turns out it isn't:

> dadaFs_x <- dadaFs[["sample_1"]]
> dadaFs_x
dada-class: object describing DADA2 denoising results
168 sample sequences were inferred from 16299 input unique sequences.
(...)

I also can't get:

> nrow(dadaFs_x$sequence)
NULL

Where will I find the 1123 corresponding reads? Thank you.

benjjneb commented 6 years ago

The $fwd and $rev columns are the index into the dada-class object. So continuing from your code...

dadaFs_x <- dadaFs[["sample_1"]]
rejects.fwd.seq <- dadaFs_x$sequence[rejects$fwd]
dadaRs_x <- dadaRs[["sample_1"]]
rejects.rev.seq <- dadaRs_x$sequence[rejects$rev]

That will get you a list of the unique forward sequneces, and unique reverse sequences, for each non-merging pair of unique sequences.

benjjneb commented 6 years ago

@seedmicrobes In addition to mapping which reads go to which sequence variant, is there any way of indicating the confidence level by which a read is assigned the sequence variant? As I understand it, DADA2 bins all reads into the predicted sequence variants. This can result in rare taxa (below DADA2 detection) incorrectly being subsumed into more abundant sequence variants. Perhaps there is even a way to set a confidence threshold, such that only the most likely reads are assigned to a sequence variant, and the lower-confidence reads are output separately for alternative OTU-picking strategies.

This feature (or near to it) has been added in bf196a274ee6c0e038fe87a21355c95e163d47cb

From ?setDadaOpt:

OMEGA_C: The threshold at which unique sequences inferred to contain errors are corrected in the final output. The probability that each unique sequence is generated at its observed abundance from the center of its final partition is evaluated, and compared to OMEGA_C. If that probability is >= OMEGA_C, it is "corrected", i.e. replaced by the partition center sequence. The special value of 0 corresponds to correcting all input sequences, and any value > 1 corresponds to performing no correction on sequences found to contain errors. Default is 0 (i.e. correct all).

benjjneb commented 6 years ago

OMEGA_C = 1e-40 by default as of: 1f6e4681de199317dc459ae694e48b0dbc0f6fe5

benjjneb commented 6 years ago

Full read fate mapping won't make it into 1.8. Still desirable long-term though.

aistBMRG commented 5 years ago

Just curious as to whether the latest version of dada2 allows tracking the fate of each read. In my case, I have PE reads were the forward and reverse reads are from different genes (fusion PCR amplicons) so I am denoising both reads separately but would need to link both members of the read pairs again in terms of denoised reads for downstream analysis.

Thanks.

Dieter.

benjjneb commented 5 years ago

@aistBMRG

We still haven't implemented full end-to-end read tracking, hence this issue still being open, but read tracking from filtered fastqs to denoised sequences is already there. You can map each read, by position, in the dereplicated fastq file to its denoised sequence with:

fn <- "mysample.fastq"
drp <- derepFastq(fn)
dd <- dada(drp, err=my.err.model, multi=TRUE)
read.map <- dd$map[drp$map]
# map is a vector with entries corresponding to each read in the fastq, and values corresponding to the index of the denoised sequence
read.to.denoised.sequence <- dd$sequence[dd$map[drp$map]]

I am denoising both reads separately but would need to link both members of the read pairs again in terms of denoised reads for downstream analysis.

I think this is already what is done by the mergePairs function, have you looked at that? mergePairs links together the corresponding F/R reads using their denoised sequences. By default it will throw out pairs that don't overlap, but you can use mergePairs(..., justConcatenate=TRUE) to turn off that behavior.

aistBMRG commented 5 years ago

Thanks for the prompt reply -- the justConcatenate option will do exactly what I need!

michael-weinstein commented 5 years ago

I am thinking about ways to pull this off as well... looking in at the code, I am wondering if this might be easier to do initially with unpaired reads. I'm digging in to the code a bit and wondering if it would be possible to kick out a table of unique read fate at each step. This seems like it should work reasonably well (although it feels a bit hacky) for fate tracing on non-paired reads (I am not sure about dealing with the read-merging process yet). Essentially at each step, I could create a table that just contains [original read seq]:[read fate seq] as the reads are processed (for performance sake, I might store the sequences as MD5 hashes). The only one I haven't gotten into enough to know if it would work is the dada( ) function. If this approach sounds like it could work, could I create a fork and try it out?

benjjneb commented 5 years ago

@michael-weinstein Doing the single-read case is definitely easier. Much of the read fate information is available from the $map output of derepFastq and dada. Right now there is no read fate output from filtering or chimera removal. I think people would find an end-to-end read fate table useful though.

michael-weinstein commented 5 years ago

I’m going to take a crack at this when I get a bit of time. It sounds like an interesting challenge.

How did you like FIGARO, by the way?

benjjneb commented 5 years ago

I read the paper but trying out FIGARO is still on the to-do list. (was on vacation after ASM) Might be worth opening a GitHub issue on it.

On Wed, Jul 10, 2019 at 12:56 PM Michael Weinstein notifications@github.com wrote:

I’m going to take a crack at this when I get a bit of time. It sounds like an interesting challenge.

How did you like FIGARO, by the way?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/354?email_source=notifications&email_token=ABMHKVD7DI46KUXUICWDWVTP6YICPA5CNFSM4EA4FWU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUCY7A#issuecomment-510143612, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMHKVBOBZXFFRRYPGBCMY3P6YICPANCNFSM4EA4FWUQ .

michael-weinstein commented 5 years ago

Sounds good. I wasn't sure if you wanted me to open an issue on it. I will do so right now.

benjjneb / dada2

output fate of every read #354

Confidence

Prediction