flefler / CyanoSeq

CyanoSeq: A curated cyanobacterial 16S rRNA database for next-generation sequencing
Creative Commons Attribution 4.0 International
11 stars 0 forks source link
amplicon-datasets amplicon-sequencing cyanobacteria metabarcoding taxonomic-classification taxonomy-assignment taxonomy-database

CyanoSeq V1.3

DOI

Current version: 1.3

CyanoSeq is published in the Journal of Phycology: https://doi.org/10.1111/jpy.13335

CyanoSeq is a curated database of cyanobacterial 16S rRNA sequences for taxonomic assignment of metagenomic/metabarcoding/amplicon reads. CyanoSeq is assembled from 16S rRNA sequences found within NCBI, with their taxonomies curated from cyanobacterial taxonomic literature as well as a systematic assessment of uncharacterized cyanobacterial sequences. When possible, the full length 16S rRNA sequences are provided, allowing for several 16S rRNA primer sets to be used for taxonomic assignment of metabarcoding data, as well as de novo phylogenetic reconstruction. The taxonomy of CyanoSeq is meant to reflect the current state of cyanobacterial taxonomy with curated clades of described and undescribed taxa. A provisional rank was given to those taxa that fell outside of the sensu stricto clade in an attempt to resolve polyphyletic ranks. CyanoSeq does not aim to revise cyanobacterial taxonomy nor become a taxonomic authority, rather it serves as a starting point to identify and name monophyletic clades which do not belong to any established taxonomic rank and require revision. CyanoSeq currently contains 4174 cyanobacterial sequences and 123 chloroplast and bacterial sequences for use in classifying reads from metabarcoding studies.

This update was done in conjuction with an update to the cyanobacterial taxonomy of ITIS using resources such as AlgaeBase, CyanoDB, as well as recent literature.

Key updates V1.3

A few changes have been made since the last version, which are noted in the change log. A few key points listed below.

1: In addition to SILVA 138.2, we are also including additional files with GSR-DB as the bacterial database. Please cite Leidy-Alejandra G. Molano, Sara Vega-Abellaneda, Chaysavanh Manichanh. GSR-DB: a manually curated and optimized taxonomical database for 16S rRNA amplicon analysis. mSystems (2024) https://doi.org/10.1128/msystems.00950-23 in addition to CyanoSeq if you use this version

2: The family Prochlorococcaceae has proven to be a headache for curation and now has its taxonomy modeled after GTDB R220 with some minor changes, see the Taxonomy_V1.3.xlsx files for more in depth information.

3: Greatly reduced the number of sequences, primarily in over represented genera (e.g., Dolichospermum, Prochlorococcus, etc) to reduce redundancy. This should help with classification, especially with the difficult groups (e.g., ADA-Aphanizomenon/Dolichospermum/Anabaena)

4: To facilite de novo phylogenetic reconstruction, the file NCBI_ClassifiedSeqs.tsv is provided on Zenodo. Example R script for manipulation can be found here. Script on how this was done can be found here.

Files

Two fastq.qz files are provdied for taxonomic assignment using the "assignTaxonomy" function in DADA2 and IdTaxa classifiers. IDTAXA files have not been tested, please let me know if these work or not.

Scripts are now provided to create QIIME2 classifiers. Thanks to Lucija Kranjer. Necessary files for are provided.

CyanoSeqV1.3_dada2.fastq.gz is the Cyanobacterial data bases which contains 4174 cyanobacterial sequences with 123 chloroplast and bacterial sequences. This should only be used with cyanobacterial specific primers (i.e., those described by Nübel et al., 1997)

CyanoSeqV1.3_GSRDB_dada2.fastq.gz is the cyanobacterial database merged with GSR-DB, with the cyanobacterial sequences removed and replaced with those curated here. This can be used general bacterial primers to characterize the total bacterial community.

CyanoSeqV1.3_SILVA138.2_dada2.fastq.gz is the cyanobacterial database merged with SILVA 138.2, with the cyanobacterial sequences removed and replaced with those curated here. This can be used general bacterial primers to characterize the total bacterial community.

Fasta and nwk files of each order are provided to facilitate de novo phylogenetic tree reconstruction for novel sequences and use of tools such as epa-ng for placement of your ASVs

Questions, comments, concerns?

Leave a request or start a discussion here