benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Formatting custom databases #581

Closed Sebastian-Mynott closed 5 years ago

Sebastian-Mynott commented 5 years ago

Thank you very much for dada2! It is my new favourite method for working with MiSeq reads!

I do environmental metabarcoding so I need to create custom databases for community analysis. Would you have any tips/instructions/tutorial on how to download and format custom databases within R?

Many thanks!

benjjneb commented 5 years ago

Glad it's helpful for you!

Have you seen the formatting custom databases entry on our web site?

Does that answer your question? Or is it something not covered there?

Sebastian-Mynott commented 5 years ago

Yes, I've seen that entry. On that same page you have links to collections for SILVA and others. I'm looking for instructions on formatting data from other sources, such as NCBI and BOLD, for example.

benjjneb commented 5 years ago

Basically whatever the source, it needs to be distilled into a fasta file with the expected format

>Level1;Level2;Level3;Level4;Level5;Level6; ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC >Level1;Level2;Level3;Level4;Level5; CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

As for how to do that exactly, it will involve some parsing and reformatting that will depend on what format the database you are trying to use is in, so hard to give a simple answer there. That part can be done with R or non-R tools, such as shell scripts or Python.