Adding UNITE ITS special database

ropolomx commented 5 years ago

Hello, Along the same lines of issue #94, I was hoping that the UNITE ITS database could be included as a special database in Kraken2. The FASTA headers contain the scientific name, the Genbank accession number, an UNITE identifier, and the taxonomic lineage. This taxonomy could potentially be parsed from the headers and formatted in a similar way than the 16S databases. An example of one of the headers is:

>Symbiotaphrina_buchneri|DQ248313|SH1641879.08FU|reps|k__Fungi;p__Ascomycota;c__Xylonomycetes;o__Symbiotaphrinales;f__Symbiotaphrinaceae;g__Symbiotaphrina;s__Symbiotaphrina_buchneri

ropolomx commented 5 years ago

I just realized that the included Genbank accession number can be merged with the NCBI taxonomy. I will give this a try.

zoey-rw commented 4 years ago

@ropolomx I know this issue is over a year old, but if possible, would you mind sharing if/how you ended up doing this?

vmikk commented 4 years ago

While it is certainly possible to map sequence IDs to NCBI taxIDs, taxonomy presented in NCBI could be incorrect. In sequence headers from UNITE, taxonomic lineage information was re-annotated, e.g. HM044632 is Uncultured fungus in NCBI, while in UNITE it is annotated to the genus level - Meliniomyces.

Therefore, it is better to find corresponding taxid for UNITE taxonomy string (not for sequence ID). I've started doing so, but I have no idea what to do with taxa missing in NCBI (so names.dmp and nodes.dmp files should also be updated somehow).

wqssf102 commented 3 years ago

this is my method, to download the unite databasehttps://files.plutof.ut.ee/public/orig/E8/83/E883EB19E3EA7B64C1F652521301239831FAFE0BFF015C9E2B4786DC0976C0FC.gz, and rename it to 'unite_8.2.fasta' for me, then i run:

mkdir -p ./db
sed 's/;tax=d:Fungi/\tLineage=Root;rootrank;Fungi;domain/g' unite_8.2.fasta \
|sed 's/,s:*.*//g' \
|sed 's/\(p\:[a-zA-Z0-9_ -]*\)/\1;phylum/g' \
|sed 's/\(c\:[a-zA-Z0-9_ -]*\)/\1;class/g' \
|sed 's/\(o\:[a-zA-Z0-9_ -]*\)/\1;order/g' \
|sed 's/\(f\:[a-zA-Z0-9_ -]*\)/\1;family/g' \
|sed 's/\(g\:[a-zA-Z0-9_ -]*\)/\1;genus/g' \
|sed 's/,p:/;p__/g' \
|sed 's/,c:/;c__/g' \
|sed 's/,o:/;o__/g' \
|sed 's/,f:/;f__/g' \
|sed 's/,g:/;g__/g' > ./db/kunite.fasta
perl build_rdp_taxonomy.pl ./db/kunite.fasta
mkdir -p ./db/library 
mv ./db/kunite.fasta ./db/library/unite.fna
mkdir -p ./db/taxonomy
mv names.dmp nodes.dmp ./db/taxonomy
mv seqid2taxid.map ./db
kraken2-build --build --db ./db --threads 16

Note, the ' build_rdp_taxonomy.pl' comes from kraken2. If you find any defects in this method, please kindly inform me. Thank you

teixeirainpp commented 3 years ago

Hi! Thank you for sharing your code.

I've been trying to perform taxonomic assignments of ITS sequences using UNITE and Kraken2. I tried your solution and it seems that the UNITE is properly formatted for Kraken2, however it can't classify any read. I don't know if it is any problem with versions or something else... Does anyone tried it recently?

Thank you

wqssf102 commented 3 years ago

first, i get the fasta file from the web-"PlutoF biodiversity platform", and then running the codes as flow:

mkdir -p ./db

perl build_rdp_taxonomy.pl ./db/kunite.fasta

mkdir -p ./db/library mv ./db/kunite.fasta ./db/library/unite.fna mkdir -p ./db/taxonomy mv names.dmp nodes.dmp ./db/taxonomy mv seqid2taxid.map ./db kraken2-build --build --db /mnt/d/krakenunite/db83 --threads 16

------------------ 原始邮件 ------------------ 发件人: "DerrickWood/kraken2" @.>; 发送时间: 2021年7月31日(星期六) 凌晨1:10 @.>; @.**@.>; 主题: Re: [DerrickWood/kraken2] Adding UNITE ITS special database (#97)

Hi! Thank you for sharing your code.

I've been trying to perform taxonomic assignments of ITS sequences using UNITE and Kraken2. I tried your solution and it seems that the UNITE is properly formatted for Kraken2, however it can't classify any read. I don't know if it is any problem with versions or something else... Does anyone tried it recently?

Thank you

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

teixeirainpp commented 3 years ago

Thank you, I can't make it work with your solution either.

The headers are already as:

Thelephoraceae_sp|HM100661|SH1140862.08FU|reps_singleton|kFungi;pBasidiomycota;cAgaricomycetes;oThelephorales;fThelephoraceae;gunidentified;s__Thelephoraceae_sp

So the first set of seds don't seem to do anything.

An entry example of seqid2taxid is: Aa_paleacea|KX421909|SH1621287.08FU|reps|kViridiplantae;pAnthophyta;cMonocotyledonae;oAsparagales;fOrchidaceae;gAa;s__Aa_paleacea

I tried to change the headers to have just the accession number and the taxonomic string separated by a space, as well as other variations but it always results in those 'empty' *dmp.

Any idea about what I am missing?

Thank you

teixeirainpp commented 3 years ago

Hi, I followed your method but did the following:

It seems to work fine.

wqssf102 commented 3 years ago


mkdir -p ./db

sed 's/;tax=d:Fungi/\tLineage=Root;rootrank;Fungi;domain/g' unite8.3.fasta \
|sed 's/,s:*.*//g' \
|sed 's/\(p\:[a-zA-Z0-9_ -]*\)/\1;phylum/g' \
|sed 's/\(c\:[a-zA-Z0-9_ -]*\)/\1;class/g' \
|sed 's/\(o\:[a-zA-Z0-9_ -]*\)/\1;order/g' \
|sed 's/\(f\:[a-zA-Z0-9_ -]*\)/\1;family/g' \
|sed 's/\(g\:[a-zA-Z0-9_ -]*\)/\1;genus/g' \
|sed 's/,p:/;p__/g' \
|sed 's/,c:/;c__/g' \
|sed 's/,o:/;o__/g' \
|sed 's/,f:/;f__/g' \
|sed 's/,g:/;g__/g' \
|sed 's/;$//g'> ./db/kunite.fasta

perl build_rdp_taxonomy.pl ./db/kunite.fasta

mkdir -p ./db/library 
mv ./db/kunite.fasta ./db/library/unite.fna
mkdir -p ./db/taxonomy
mv names.dmp nodes.dmp ./db/taxonomy
mv seqid2taxid.map ./db

kyleabeauchamp commented 2 years ago

Anyone have any updates on this? I'd love to use the UNITE DB; I saw it used in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8565518/

kyleabeauchamp commented 2 years ago

Or are there any known downloadable UNITE-based kraken2 databases online?

tmbogus commented 2 years ago

The issue with kraken2-build seems to be using threads. Maybe because of the size of the DB. Not specifying the number of threads did it: kraken2-build --build --db Unite_8.3_Fungi_kraken2

This refactor in Perl of the header changing code did the job as well:

perl -pe '
s/>([^\|]+)\|([^\|]+)\|([^\|]+)/>$2 $1; $3/;
s/\|re[a-z]s_*[a-z]*\|k__Fungi/\tLineage=Root;rootrank;Fungi;domain/g;
s/\|/ /g;
s/;s__*.*//g;
s/p__([a-zA-Z0-9_ -]*)/\1;phylum/g;
s/c__([a-zA-Z0-9_ -]*)/\1;class/g;
s/o__([a-zA-Z0-9_ -]*)/\1;order/g;
s/f__([a-zA-Z0-9_ -]*)/\1;family/g;
s/g__([a-zA-Z0-9_ -]*)/\1;genus/g;
s/;$//g' sh_general_release_dynamic_10.05.2021.fasta > sh_general_release_dynamic_10.05.2021.kraken.fasta

davidbio commented 2 weeks ago

Hi everyone,

I was also trying to use the UNITE database with Kraken2 and I ended up modifying the Greengenes script 16S_gg_installation.sh. Please find the script attached. I successfully created 2 databases with files from the shown categories (version 10).

The script attached script should be placed into the kraken2 installation folder and you should add the kraken2 installation folder to your PATH variable.

export PATH=$PATH:/path/to/k2-installation

kraken2-unite-build.zip

This should work out-of-the box. I would appreciate any feedback.

DerrickWood / kraken2

Adding UNITE ITS special database #97