grouping with derepFastq

carolina671 commented 5 years ago

Hi,

I have used the dada2 pipeline to process sequencing data from a gene different to 16S called nifH. This gene can be found in multiple copies and can be transfered horizontally, therefore two species can share the exact same sequence and it not very conserved.

I would like to know if there is a way of modifying the derepFastq command in order to make it less stringent when grouping the ASV. For example, in the OTU table i have found ASV sequences that differ only by one base (according to NCBI nBLAST). Therefore I would like to know if there is a way that this sequence can be classified as only one ASV, . Please see the two sequences below:

AAGCACCACCACACAAAACACAGTAGCCGGATTGGCTGAAATGGGCAGAAAAGTAATGGTTGTAGGATGCGACCCAAAAGCAGACTCTACACGTTTACTCCTTCATGGGTTGGCACAAAAAACAGTATTGGATACACTTCGCGACGAAGGCGAAGATGTTGAATTGGACGACGTAATGAAAGAAGGATTTAAAAACACCAGCTGTGTGGAATCCGGCGGTCCGGAACCGGGCGTTGGTTGTGCAGGCCGTGGTATTATCACTTCTATCAACCTTTTGGAACAACTTGGCGCTTACGATGCTGATAAACGACTTGATTACGTATTTTACGATGTACTTGGCG

AAGCACCACCACACAAAACACAGTAGCCGGATTGGCTGAAATGGGCAGAAAAGTAATGGTTGTAGGATGCGACCCTAAAGCAGACTCTACACGTTTACTCCTTCATGGGTTGGCACAAAAAACAGTATTGGATACACTTCGCGACGAAGGCGAAGATGTTGAATTGGACGACGTAATGAAAGAAGGATTTAAAAACACCAGCTGTGTGGAATCCGGCGGTCCGGAACCGGGCGTTGGTTGTGCAGGCCGTGGTATTATCACTTCTATCAACCTTTTGGAACAACTTGGCGCTTACGATGCTGACAAACGACTTGATTACGTATTTTACGATGTACTTGGTG

Thanks,

carolina671 commented 5 years ago

benjjneb commented 5 years ago

derepFastq only deals with exact matching between sequences. The dada command generates the denoised ASVs. You can tell that command to ignore real differences less than some threshold with the MIN_HAMMING parameter, e.g. dada(..., MIN_HAMMING=2) requires at least 2 differences to split ASVs, hence would group together real ASVs that differ in just one place. But tread carefully. You could also consider clustering the output of dada2 with another tool to make traditional OTUs.

carolina671 commented 5 years ago

Thanks, I think is not a good idea to merge this ASV because if they are both highly abundant I would say that they can be SNIPs. Could you please recommend any package to make traditional OTUs?

benjjneb commented 5 years ago

In R, you can use the DECIPHER package to group ASVs into OTUs, see for example the IdClusters function. Does @digitalwright know of any DECIPHER OTU workflows out there?

You can also consider outside packages such as usearch/mothur/qiime and the like if your goal is OTUs.

digitalwright commented 5 years ago

Yes, it is straightforward to generate OTUs with the DECIPHER package. There are two ways to do it, both of which use the IdClusters function and start from unaligned ASVs in the form of a DNAStringSet.

(1) Cluster the sequences directly (inexact but scales well to tens of thousands of sequences): otus <- IdClusters(myXStringSet=dna, method="inexact", cutoff=0.03)

(2) Cluster via a distance matrix (exact but scales approximately quadratically): DNA <- AlignSeqs(dna, processors=NULL) # there are many variants of this (see documentation) d <- DistanceMatrix(DNA, processors=NULL) otus <- IdClusters(d, method="complete", cutoff=0.03, processors=NULL)

I hope that helps!

carolina671 commented 5 years ago

Thanks for all the info

benjjneb / dada2

grouping with derepFastq #669