benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Extracting original sequences #2034

Open SheepwormJM opened 2 weeks ago

SheepwormJM commented 2 weeks ago

Hello,

Thank you for DADA2.

I have my output table of ASV sequences and their abundance following chimera removal. However, what I would really like is a way to get the original sequence names back out. I would like to use these to filter an output file from PEAR to then align to a genome and call variants, obtaining allele frequencies.

Is there a way to do this?

Thanks, Jenni

benjjneb commented 2 weeks ago

the original sequence names

Can you clarify what this means? The original non-error-corrected read? Or the sequence ID related to the read?

SheepwormJM commented 2 weeks ago

Hi Ben,

Thanks. The sequence ID related to the read probably would be the aim. I've now found a way using BioPython to take the ASV output by DADA2 and filter my PEAR output file (consensus sequences of read pairs) for everything that is the same as the haplotype (technically that 'starts with it', but it's should be the same thing). I lost some of my reads that presumably were those that had been error corrected by DADA2, but the relative ratio of each ASV was pretty close to that output by DADA2. So, I'd be happy to use that, but also interested to know if there is a way to pull back out the original sequence IDs.

Thanks, Jenni

benjjneb commented 1 week ago

Read-fate tracking is there up through denoising, but requires a bit of custom R code to actually implement.

The key is the $map vectors recorded in dada-class and derep-class objects, as described here https://github.com/benjjneb/dada2/issues/354#issuecomment-339662233

When unique sequences fail denoising, they get a corresponding $map value of NA, so by tracing back the NAs in the output of denoising, they can be linked back to the set of reads.

help("derep-class") and help("dada-class") have a bit more information about the $map vectors. But this functionality is not wrapped up nicely, so I'm happy to assist further.