benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

dada2 pipeline for metabarcoding #1520

Closed safiqu closed 1 month ago

safiqu commented 2 years ago

dada2 pipeline for metabarcoding (CO1)

Thanks a lot for your dada2 pipeline, this is really cool. Our research group has been using it for our human microbiome projects. This is really cool.

We want to use your dada2 pipeline for our metabarcoding projects. Mostly we are sequencing 658bp long CO1 genes for species identification. We need to use a reference data set from this data repository or Barcode of LIfe (BOLD) (https://v3.boldsystems.org/index.php/datarelease ) data set.

Do you have any pipeline that I can follow?

I am facing a problem with this step.

Assign taxonomy

tax <- assignTaxonomy(IMGM_nochim, refFasta = "/Users/sislam/Desktop/Metagenomics-R/iBOL.tsv.zip", multithread = T) tax <- addSpecies(IMGM_tax, "/Users/sislam/Desktop/Metagenomics-R/ iBOL.tsv.zip ") colnames(tax) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus","Species") taxa.print <-tax rownames(taxa.print) <- NULL head(taxa.print)

Somehow this is not working. Do you have any suggestions that I can follow? Thanks a lot for your nice development for the science

Kind Regards Safi

benjjneb commented 2 years ago

assignTaxonomy requires the reference data set to have a specific format, that is described here: https://benjjneb.github.io/dada2/training.html#formatting-custom-databases

I am not myself familiar with the BOLD files you are working with, but the fact that it has a .tsv file extension strongly suggests that it is not in the fasta format expected by assignTaxonomy. So, I'd take a look at the format described at the link, and then see how to convert the BOLD files into that format.

Sbu211 commented 2 years ago

Hi thanks you for the Dada2 pipeline. It has really given me hope to easily analyze my HTS data. I would like to ask whether the assignTaxonomy would take the input of the Genbank database( COI database i.e nt.tar.gz) ? the ones in fasta format on the ftp site?...In addition is it possible to also add sequences that were generated in a normal Sanger sequence to this databas locally and sort of make a customized database?

Thank you

benjjneb commented 2 years ago

@Sbu211 As mentioned above, the format required by assignTaxonomy is described here: https://benjjneb.github.io/dada2/training.html#formatting-custom-databases

You can certainly add sequences to make a custom database, as long the combined fasta file conforms to that format.