benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Sequence Extraction #1321

Closed markdjohnson closed 3 years ago

markdjohnson commented 3 years ago

Hello,

I am using dada2 for a metabarcoding project and I was curious if there was a way for me to extract the sequences identified during the tracking step of the dada2 pipeline. For instance, I would like to get those sequences that are under the nonchim catagory and then run a blast on those now filtered sequences. How can I extract them?

Any help would be apprecaited! Mark Johnson

input filtered denoisedF denoisedR merged nonchim 21-1-trnL 6242 5657 5542 5555 5509 5508 21-2-trnL 8631 8104 8050 8062 8039 8012 21-3-trnL 11648 10953 10839 10851 10771 10605 21-4-trnL 9738 9120 8936 8986 8771 8731 21-5-trnL 7388 6925 6824 6866 6700 6535 21-6-trnL 7120 6721 6621 6634 6531 6269

track

benjjneb commented 3 years ago

Identifying the provenance of each denoised read is fairly straightforward, up until merging (where it is still completely possible, just a bit harder). See this issues and those linked for some guidance: https://github.com/benjjneb/dada2/issues/354

markdjohnson commented 3 years ago

Thank you so much, I really appreciate it. I'm about to go into the field for a few days so I may comment again next week if I am still having issues.

Mark

markdjohnson commented 3 years ago

Hello Ben,

So I wanted to clarify my request a little bit because I am still having a bit of trouble. So down below is a piece of the summary table from the DADA2 pipeline. I would like to extract only the 5508 sequences shown under the nonchim category. These sequences are the filtered, denoised, merged, and nonchim sequences I would like to run a blast on in genbank. Is there a way to extract these 5508 successfully filtered sequences. Thank you so much for your help, I really appreciate you taking the time to answer questions like these. input filtered denoisedF denoisedR merged nonchim Sample 1 6242 5657 5542 5555 5509 5508

Mark

benjjneb commented 3 years ago

Ah, you just want to get the sequences in a format that is usable outside R, like BLAST?

library(dada2)
sq <- getSequences(seqtab.nochim)
writeFasta(sq, "path/to/output.fa")

And you should have those sequence in fasta format.

markdjohnson commented 3 years ago

Hi Ben,

When I did the sq <- getSequences(seqtab.nochim) it gave a result that had only621 sequences. Based on the table posted below, I would have expected to get 306238 sequences from all my samples combined in the nonchim row. Additionally, when I tried to do the writeFasta it told me it couldn't find the writeFasta function. Any help you could provide would be greatly appreciated. Thank you so much for continuing to answer my questions.

      input filtered denoisedF denoisedR merged nonchim

21-1-trnL 6242 5657 5542 5555 5509 5508 21-2-trnL 8631 8104 8050 8062 8039 8012 21-3-trnL 11648 10953 10839 10851 10771 10605 21-4-trnL 9738 9120 8936 8986 8771 8731 21-5-trnL 7388 6925 6824 6866 6700 6535 21-6-trnL 7120 6721 6621 6634 6531 6269 21-7-trnL 8314 7782 7673 7707 7572 7488 21-8-trnL 8103 7572 7440 7460 7212 7059 21-9-trnL 4615 4161 4025 4034 3835 3738 9-1-trnL 6507 6101 5975 5999 5839 5654 9-2-trnL 6040 5548 5317 5354 5189 5097 9-3-trnL 12217 11507 11367 11401 11283 10999 9-4-trnL 9743 9118 9031 9050 8791 8606 9-5-trnL 9910 8525 8349 8378 8184 8026 9-6-trnL 5904 5547 5487 5497 5350 5246 9-7-trnL 4293 4070 3992 4003 3908 3790 9-8-trnL 8171 7243 7111 7138 6965 6845 9-9-trnL 6832 6457 6397 6382 6236 6164 C1-trnL 8806 7782 7684 7656 7552 7244 C10-trnL 11755 11148 11031 11079 10943 10758 C11-trnL 8960 8513 8366 8380 8225 8174 C12-trnL 10022 9531 9381 9422 9136 9115 C13-trnL 7079 6727 6569 6630 6472 6472 C14-trnL 9020 8601 8465 8482 8292 8284 C15-trnL 9131 8636 8430 8437 8023 8021 C16-trnL 8210 7684 7393 7445 7287 7287 C17-trnL 7233 6877 6780 6808 6683 6467 C18-trnL 11759 11144 10980 11036 10865 10703 C19-trnL 10700 10143 9999 10023 9858 9650 C2-trnL 10426 9654 9507 9515 9372 9156 C20-trnL 6460 6021 5923 5955 5670 5513 C21-trnL 5292 4969 4877 4896 4823 4814 C22-trnL 13226 12564 12407 12440 12096 12039 C3-trnL 11049 9580 9486 9486 9435 9141 C4-trnL 9154 8378 8275 8316 8138 8138 C5-trnL 8176 7497 7427 7439 7357 7178 C6-trnL 9360 8465 8381 8389 8280 8216 C7-trnL 10318 9418 9175 9279 9046 8791 C8-trnL 9885 9089 8993 8990 8879 8878 C9-trnL 8474 8092 7961 7960 7847 7827

benjjneb commented 3 years ago

There is a difference between "reads" and "sequences" here. `getSequences' will extract all the "sequences", i.e. the unique DNA sequences in the sequence table. That number will not be the same as the number of reads, which are the numbers that are being shown in the results posted above, because some sequences are represented by many reads.

Additionally, when I tried to do the writeFasta it told me it couldn't find the writeFasta function.

My fault there, you need to load the "ShortRead" package as well as the "dada2" package to use writeFasta in that way.

markdjohnson commented 3 years ago

Hi Ben,

It looks like I got everything to work, thank you so much for all your help and for running this amazing package!

Mark

On Thu, Apr 22, 2021 at 5:57 PM Benjamin Callahan @.***> wrote:

There is a difference between "reads" and "sequences" here. `getSequences' will extract all the "sequences", i.e. the unique DNA sequences in the sequence table. That number will not be the same as the number of reads, which are the numbers that are being shown in the results posted above, because some sequences are represented by many reads.

Additionally, when I tried to do the writeFasta it told me it couldn't find the writeFasta function.

My fault there, you need to load the "ShortRead" package as well as the "dada2" package to use writeFasta in that way.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1321#issuecomment-825233793, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATWFZACETJXAYF4BVHUVWE3TKCSUTANCNFSM427WMMMA .

markdjohnson commented 3 years ago

Hi Ben,

I actually had one hopefully last question for you. Is there a way to see how many reads go into each unique sequence that is extracted. So for example, I have 621 unique sequences, can I see if one of them only had a single read versus one that has 10,000.

Mark

benjjneb commented 3 years ago

In R you just can look at the entry in the sequence table corresponding to that sample and ASV. E.g. for the 12th ASV, seqtab.nochim["C12-trnL", 12] will show you how many times that ASV was seen in that sample. To get the total number across all samples, you can look at the column sums, e.g. colSums(seqtab.nochim) will list the total abundance across all samples for each ASV, in order.