Open ropolomx opened 5 years ago
I just realized that the included Genbank accession number can be merged with the NCBI taxonomy. I will give this a try.
@ropolomx I know this issue is over a year old, but if possible, would you mind sharing if/how you ended up doing this?
While it is certainly possible to map sequence IDs to NCBI taxIDs, taxonomy presented in NCBI could be incorrect. In sequence headers from UNITE, taxonomic lineage information was re-annotated, e.g. HM044632 is Uncultured fungus
in NCBI, while in UNITE it is annotated to the genus level - Meliniomyces.
Therefore, it is better to find corresponding taxid for UNITE taxonomy string (not for sequence ID).
I've started doing so, but I have no idea what to do with taxa missing in NCBI (so names.dmp
and nodes.dmp
files should also be updated somehow).
this is my method, to download the unite databasehttps://files.plutof.ut.ee/public/orig/E8/83/E883EB19E3EA7B64C1F652521301239831FAFE0BFF015C9E2B4786DC0976C0FC.gz, and rename it to 'unite_8.2.fasta' for me, then i run:
mkdir -p ./db
sed 's/;tax=d:Fungi/\tLineage=Root;rootrank;Fungi;domain/g' unite_8.2.fasta \
|sed 's/,s:*.*//g' \
|sed 's/\(p\:[a-zA-Z0-9_ -]*\)/\1;phylum/g' \
|sed 's/\(c\:[a-zA-Z0-9_ -]*\)/\1;class/g' \
|sed 's/\(o\:[a-zA-Z0-9_ -]*\)/\1;order/g' \
|sed 's/\(f\:[a-zA-Z0-9_ -]*\)/\1;family/g' \
|sed 's/\(g\:[a-zA-Z0-9_ -]*\)/\1;genus/g' \
|sed 's/,p:/;p__/g' \
|sed 's/,c:/;c__/g' \
|sed 's/,o:/;o__/g' \
|sed 's/,f:/;f__/g' \
|sed 's/,g:/;g__/g' > ./db/kunite.fasta
perl build_rdp_taxonomy.pl ./db/kunite.fasta
mkdir -p ./db/library
mv ./db/kunite.fasta ./db/library/unite.fna
mkdir -p ./db/taxonomy
mv names.dmp nodes.dmp ./db/taxonomy
mv seqid2taxid.map ./db
kraken2-build --build --db ./db --threads 16
Note, the ' build_rdp_taxonomy.pl' comes from kraken2. If you find any defects in this method, please kindly inform me. Thank you
Hi! Thank you for sharing your code.
I've been trying to perform taxonomic assignments of ITS sequences using UNITE and Kraken2. I tried your solution and it seems that the UNITE is properly formatted for Kraken2, however it can't classify any read. I don't know if it is any problem with versions or something else... Does anyone tried it recently?
Thank you
first, i get the fasta file from the web-"PlutoF biodiversity platform", and then running the codes as flow:
mkdir -p ./db
sed 's/;tax=d:Fungi/\tLineage=Root;rootrank;Fungi;domain/g' unite8.3.fasta \ |sed 's/,s:.//g' \ |sed 's/(p\:[a-zA-Z0-9 -])/\1;phylum/g' \ |sed 's/(c\:[a-zA-Z0-9_ -])/\1;class/g' \ |sed 's/(o\:[a-zA-Z0-9 -]*)/\1;order/g' \ |sed 's/(f\:[a-zA-Z0-9 -])/\1;family/g' \ |sed 's/(g\:[a-zA-Z0-9_ -])/\1;genus/g' \ |sed 's/,p:/;p/g' \ |sed 's/,c:/;c/g' \ |sed 's/,o:/;o/g' \ |sed 's/,f:/;f/g' \ |sed 's/,g:/;g__/g' \ |sed 's/;$//g'> ./db/kunite.fasta
perl build_rdp_taxonomy.pl ./db/kunite.fasta
mkdir -p ./db/library mv ./db/kunite.fasta ./db/library/unite.fna mkdir -p ./db/taxonomy mv names.dmp nodes.dmp ./db/taxonomy mv seqid2taxid.map ./db kraken2-build --build --db /mnt/d/krakenunite/db83 --threads 16
------------------ 原始邮件 ------------------ 发件人: "DerrickWood/kraken2" @.>; 发送时间: 2021年7月31日(星期六) 凌晨1:10 @.>; @.**@.>; 主题: Re: [DerrickWood/kraken2] Adding UNITE ITS special database (#97)
Hi! Thank you for sharing your code.
I've been trying to perform taxonomic assignments of ITS sequences using UNITE and Kraken2. I tried your solution and it seems that the UNITE is properly formatted for Kraken2, however it can't classify any read. I don't know if it is any problem with versions or something else... Does anyone tried it recently?
Thank you
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Thank you, I can't make it work with your solution either.
The headers are already as:
Thelephoraceae_sp|HM100661|SH1140862.08FU|reps_singleton|kFungi;pBasidiomycota;cAgaricomycetes;oThelephorales;fThelephoraceae;gunidentified;s__Thelephoraceae_sp
So the first set of seds don't seem to do anything.
When I run the build_rdp_taxonomy.pl, the .dmp files are 'empty': names.dmp 1 | root | - | scientific name | nodes.dmp 1 | 1 | no rank | - |
An entry example of seqid2taxid is: Aa_paleacea|KX421909|SH1621287.08FU|reps|kViridiplantae;pAnthophyta;cMonocotyledonae;oAsparagales;fOrchidaceae;gAa;s__Aa_paleacea
I tried to change the headers to have just the accession number and the taxonomic string separated by a space, as well as other variations but it always results in those 'empty' *dmp.
Any idea about what I am missing?
Thank you
Hi, I followed your method but did the following:
sed 's/kFungi/\tLineage=Root;rootrank;Fungi;domain/g' kunite.fasta \ |sed 's/p//g' \ |sed 's/c/phylum;/g' \ |sed 's/o/class;/g' \ |sed 's/f/order;/g' \ |sed 's/g/family;/g' \ |sed 's/s__/genus;/g' \ |sed '/^>/s/$/\;species/'\ > un.fa
It seems to work fine.
mkdir -p ./db
sed 's/;tax=d:Fungi/\tLineage=Root;rootrank;Fungi;domain/g' unite8.3.fasta \
|sed 's/,s:*.*//g' \
|sed 's/\(p\:[a-zA-Z0-9_ -]*\)/\1;phylum/g' \
|sed 's/\(c\:[a-zA-Z0-9_ -]*\)/\1;class/g' \
|sed 's/\(o\:[a-zA-Z0-9_ -]*\)/\1;order/g' \
|sed 's/\(f\:[a-zA-Z0-9_ -]*\)/\1;family/g' \
|sed 's/\(g\:[a-zA-Z0-9_ -]*\)/\1;genus/g' \
|sed 's/,p:/;p__/g' \
|sed 's/,c:/;c__/g' \
|sed 's/,o:/;o__/g' \
|sed 's/,f:/;f__/g' \
|sed 's/,g:/;g__/g' \
|sed 's/;$//g'> ./db/kunite.fasta
perl build_rdp_taxonomy.pl ./db/kunite.fasta
mkdir -p ./db/library
mv ./db/kunite.fasta ./db/library/unite.fna
mkdir -p ./db/taxonomy
mv names.dmp nodes.dmp ./db/taxonomy
mv seqid2taxid.map ./db
Anyone have any updates on this? I'd love to use the UNITE DB; I saw it used in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8565518/
Or are there any known downloadable UNITE-based kraken2 databases online?
The issue with kraken2-build seems to be using threads.
Maybe because of the size of the DB.
Not specifying the number of threads did it:
kraken2-build --build --db Unite_8.3_Fungi_kraken2
This refactor in Perl of the header changing code did the job as well:
perl -pe '
s/>([^\|]+)\|([^\|]+)\|([^\|]+)/>$2 $1; $3/;
s/\|re[a-z]s_*[a-z]*\|k__Fungi/\tLineage=Root;rootrank;Fungi;domain/g;
s/\|/ /g;
s/;s__*.*//g;
s/p__([a-zA-Z0-9_ -]*)/\1;phylum/g;
s/c__([a-zA-Z0-9_ -]*)/\1;class/g;
s/o__([a-zA-Z0-9_ -]*)/\1;order/g;
s/f__([a-zA-Z0-9_ -]*)/\1;family/g;
s/g__([a-zA-Z0-9_ -]*)/\1;genus/g;
s/;$//g' sh_general_release_dynamic_10.05.2021.fasta > sh_general_release_dynamic_10.05.2021.kraken.fasta
Hi everyone,
I was also trying to use the UNITE database with Kraken2 and I ended up modifying the Greengenes script 16S_gg_installation.sh
. Please find the script attached.
I successfully created 2 databases with files from the shown categories (version 10).
The script attached script should be placed into the kraken2 installation folder and you should add the kraken2 installation folder to your PATH variable.
export PATH=$PATH:/path/to/k2-installation
This should work out-of-the box. I would appreciate any feedback.
Hello, Along the same lines of issue #94, I was hoping that the UNITE ITS database could be included as a special database in Kraken2. The FASTA headers contain the scientific name, the Genbank accession number, an UNITE identifier, and the taxonomic lineage. This taxonomy could potentially be parsed from the headers and formatted in a similar way than the 16S databases. An example of one of the headers is:
>Symbiotaphrina_buchneri|DQ248313|SH1641879.08FU|reps|k__Fungi;p__Ascomycota;c__Xylonomycetes;o__Symbiotaphrinales;f__Symbiotaphrinaceae;g__Symbiotaphrina;s__Symbiotaphrina_buchneri