benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

ASV fasta #803

Closed vcastroagudin closed 5 years ago

vcastroagudin commented 5 years ago

Dear Dr Callahan,

I am relatively new to the use of R and Dada2 so I apologize in advance if my question does not make sense. I am using Dada2 to looks for variants in 10 loci amplified by Illumina sequence in different population of a plant pathogenic fungi, so I do not only need the frequency of each OTU in each population ( which I could get following your tutorial) but I need the sequences of each variant and the discrimination per individual about which variant is present in each loci to determine genotypes. I was wondering if it is possible to output/make some sort of file in which I can have the ASV information as sequence (if fasta format better, so I can create an alignment) and the samples/individual in which that ASV is present. I tried to output the ASV table created in the "construct sequence table step of the tutorial" with write.table, but I can not get the info a need from it... it is such a massive mess of text.

Thanks in advance for your help,

Vanina Castroagudin - MD, USA.

benjjneb commented 5 years ago

Yes you can output about any format you want using R commands and either base R output or fasta-specific output functions like writeFasta in the dada2 or ShortRead packages.

In short, you'll want to define the sequences and associated ID strings of the fasta file you want to create. For example, to output all the ASVs in the sequences table, and define the ID line by the total abundance of each:

sq <- getSequences(seqtab)
id <- paste0("Abundance=", colSums(seqtab))
names(sq) <- id
library(dada2)
writeFasta(sq, file="path/to/myasvs.fasta")

It sounds like you want something on a per-sample basis, but you'll need to be more precise on exactly what you are trying to create, i.e. what information each file contains in terms of the sequences and their associated ID lines.