benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Extracting unmerged reads #574

Closed adityabandla closed 5 years ago

adityabandla commented 5 years ago

I am using the 515F 926R primer which targets all three domains of life. A known challenge with this primer set with regards to amplicon sequencing is that the eukaryotic sequences are ~150bp longer than the prokaryotic ones. So, it's not surprising that these sequences fail to merge.

I understand that unmerged sequences can be retained using the returnRejects=TRUE option in the mergePairs command. Is there a way to extract these sequences and write them to a fasta file in a sample-specific manner, so that they can be used for annotating eukaryotic sequences?

benjjneb commented 5 years ago

Is there a way to extract these sequences and write them to a fasta file in a sample-specific manner, so that they can be used for annotating eukaryotic sequences?

No automatic way right now. You can retrieve the raw F/R sequences for each row in the merger data.frame though. The $forward and $reverse columns have the indexes to the corresponding sequences in the corresponding dada-class object. So the code below will retrieve the unmerged F/R sequences for each row in the merger data.frame:

# Single sample
merger <- mergePairs(ddF, drpF, ddR, drpR, returnRejects=TRUE)
sqF <- ddF$sequence[merger$forward]
sqR <- ddR$sequence[merger$reverse]

ps: This is another use case for adding a new hybrid merge/concatenate option to mergePairs. See also #565

adityabandla commented 5 years ago

Thanks Ben for the pointers! By raw sequences, you mean sequences which have not been denoised?

benjjneb commented 5 years ago

Nope sorry bad terminology on my part. The denoised F/R sequences are what is being extracted there (prior to merging).