benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

how to extract all merged reads at once #1427

Closed marco2985 closed 5 months ago

marco2985 commented 3 years ago

I would like to extract all merged sequences into a fasta file. I have found information on how to extract all unique sequences by using this code uniques ToFasta(getUniques(seqtab), fout="F:/MHC_VR/Dada2/uniqueSeqs.fasta", ids=paste0("Seq", seq(length(getUniques(seqtab))))) but I would like to extract all of them at once and tranformed them into a fasta file (all copies). Do you think that is possible?

Can you help me with that? Thank you very much in advance

benjjneb commented 3 years ago

You can use the writeFasta function for that. You just give the function the vector of sequences and the file to output to. Note that you will need to deefine the id lines as the names of the vector. E.g.

library(ShortRead) ## Load this package first for this code to work consistently
library(dada2)
merged.seqs <- mergers$sequence  # Or wherever the list of seqs to output is
names(merged.seqs) <- paste0("ASV", seq_along(merged.seqs)) # e.g.
writeFasta(merged.seqs, "path/to/output.fasta")
marco2985 commented 3 years ago

That is very useful. Thank you very much. I am not using DADA for microbiome analyses this time. Therefore, I do not have "ASV", Is that a problem. Can I change ASV by MHC for example?

Actually, I would like to ask another related thing: I am working with around 300bp MHC sequences (major histocompatibility complex class II exon2). I had some problems with the MIseq run and when I combine the reads with other softwares like FLASH, I can only combine 32% of the reads. The IT bioinformaticians from my university have suggested to used DADA2 for combining forward and reverse reads by using "justconcatenated=TRUE" option and I think it is working better than by using the other software. I have done all this without demultiplexing the reads. I thought that I could extract all the merged reads and then dimultiplex. However, they have recommended to me to do the other way around. What is your suggestion? and Is it a possible way of demultiplexing by using DADA2. The problem is that my sequences has NNN- 8bp barcode- primer. Samples are dual index. Do you think I can do it with DADA2?

Thank you very much, Maria//

El vie, 15 oct 2021 a las 23:17, Benjamin Callahan (< @.***>) escribió:

You can use the writeFasta function for that. You just give the function the vector of sequences and the file to output to. Note that you will need to deefine the id lines as the names of the vector. E.g.

library(ShortRead) ## Load this package first for this code to work consistently library(dada2) merged.seqs <- mergers$sequence # Or wherever the list of seqs to output is names(merged.seqs) <- paste0("ASV", seq_along(merged.seqs)) # e.g. writeFasta(merged.seqs, "path/to/output.fasta")

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1427#issuecomment-944664579, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS62L5X2HPXBI6YTEKJOGV3UHCK45ANCNFSM5F6RAHGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

marco2985 commented 3 years ago

Hi again, I have tried the code: library(ShortRead) ## Load this package first for this code to work consistently library(dada2) merged.seqs <- mergers$sequence # Or wherever the list of seqs to output is names(merged.seqs) <- paste0("ASV", seq_along(merged.seqs)) # e.g. writeFasta(merged.seqs, "F:/MHC_VR/Dada2/output.fasta")

and I get an error: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘writeFasta’ for signature ‘"list"’

Do you know how to solve this error?. I have also tried to extract the table by using:

write.table(merged.seqs,** "dada_seqs.txt",quote=FALSE)

and I get a table with unique sequences. Is there a way to see the number of specific amplicon that I have?. I would like to extract this information in the table as well. Is this possible?

I have used the following code but when I open the table is not organized:

seqtab <- makeSequenceTable(mergers) dim(seqtab) table(nchar(getSequences(seqtab))) write.table(seqtab, "dada_seqs.txt",quote=FALSE)

I would like to have seqName, sequence and number of amplicons Thank you very much, Maria//

benjjneb commented 3 years ago

OK, it's a bit confusing since you aren't running a normal DADA2 workflow, but... it seems like you can make a sequence table, so the easiest is to work from that defined object.

It sounds like what you want is a fasta with the unique sequences, and id lines that also specify a sequence identifier and the sequence abundance. In that case, uniquesToFasta would be the easiest option, as it will create that sort of fasta format by default, using a sqXXX;size=YYY; format id line. "Size" here means the abundance of that sequence. So...

uniquesToFasta(seqtab, "path/to/output.fasta")

should do basically what you want by default, as long as the specific format of the id line is OK.

The final question is... do you want a fasta file for each sample? If so, you should run uniquesToFasta on the mergers object from each sample (the output of mergePairs). Something like:

for(i in seq_along(samples)) {
  uniquesToFasta(mergers[[i]], paste0("path/to/sample_", i, ".fasta"))
}
marco2985 commented 3 years ago

Hi,

I am trying to run everything that you suggested but anything is working.

When I try to run:

showMethods(writeFasta) object="ShortRead" merged.seqs <- mergers$sequence # Or wherever the list of seqs to output is names(merged.seqs) <- paste0("seq", seq_along(merged.seqs)) # e.g. writeFasta(merged.seqs, "F:/MHC_VR/Dada2/output.fasta")

I get this error and the fasta file is not longer written: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘writeFasta’ for signature ‘"list"’

Maybe, I have not explain correctly myself. What I would like to get is all the fasta files from the mergers, not only the uniques. I want all of them. For that, I have tried to used the code that you provide above:

for(i in seqalong(samples)) { uniquesToFasta(mergers[[i]], paste0("path/to/sample", i, ".fasta")) }

This is what I have run: Do I have to specify "samples" somehow?

_for(i in seq_along(samples)) { uniquesToFasta(mergers[[i]], paste0("F:/MHCVR/Dada2/output", i, ".fasta")) }

Error in getUniques(unqs, collapse = FALSE) : Unrecognized format: Requires named integer vector, fastq filename, dada-class, derep-class, sequence matrix, or a data.frame with $sequence and $abundance columns.

I hope you can help me, Thank you, Cheers, Maria//

ArmOli commented 3 years ago

Hi Marco, I wrote this code to export the FASTA file of the unique sequences that resulted after the merger step. it works with the library seqinr. install.packages(seqinr) library(seqinr)

nrow<-nrow(mergers) sequencename<- paste(sample.names, "dada2out", 1:nrow) fastaname<- paste(Sys.Date(), sample.names, "dada2out", "merged", "fasta", sep = ".")

write.fasta(as.list(mergers$sequence), sequencename, fastaname, open = "w", nbchar = 60, as.string = FALSE)

it returns a .fasta file, saved in your working directory, arranged in this way:

ID386t1-R2-IgA dada2out 1 CGTATCGCCTCCCTCGCGCCAACAACCAATATGCATATCTTGGGGGCAGGGGCTGCGGTCAGGGA..... ID386t1-R2-IgA dada2out 2 CGTATCGCCTCCCTCGCGCCAACAACCAATATGCATATCTTGGGGGCAGGGGCTGCGGTCAGGGA..... .........

hopefully it will be helpful

marco2985 commented 3 years ago

Thank you very much both ways were working fine. But I am running into troubles now again. In every pool I have 32 samples. I dimultiplex samples using stacks and it worked very good at the end. I would like to extract a table with $sequence $abundance per individual and then combine all (32 individuals//all pools). I have been able to write a csv file with the information per sample:

mergers1<-as.matrix(mergers$x2109) head(mergers1) write.csv(mergers1,'x2109.csv', row.names = FALSE)

This is just for sample x2109

image

But the problem is I would have to extract 32 files independently and I have 6 different pools. Therefore, I would have to do it manually 32timesx6 and write manually the most likely two alleles. Is there a way to extract a matrix for every single sample combined with the $abundance and $sequence?

Thank you very much you would save a lot of my time if this can be done in a simple way :). I know that DADA2 is mainly for 16S amplicon data. But actually it works very good for difficult amplicon run of other target genes such as MHC.

Thank you again, Cheers, Maria//

benjjneb commented 2 years ago

@marco2985 It sounds like what you need is a for loop, so you can loop through the different file names (and perhaps another to loop through the pools). This can all be done with base R commands, but the dada2 package doesn't really come in.