Closed mw55309 closed 2 years ago
Dear Mick,
Sorry for a delayed response, it has been some busy days.
I went through your code and found an error. However in your example you are missing the actual create database part of flextaxd? So flextaxd does have the functionality to add kraken headers before creating the database. Just make sure that your whole dataset has unique sequence ids.
The following command will create a kraken2 database
flextaxd-create --database test.db --dbprogram kraken2 -o taxonomy --genomes_path genomes --create_db --db_name testkraken_db kraken2
The error that I found was caused by a bug where i use strip() instead of rstrip() for gzip extensions, hence your genome will be renamed to enome and not match. Using genome names not starting with small g did work.
I will try to get out a new release that solved the bug. Until then if you need to use small g(or z) at the start of your sequences please use the branch name_bug_issue51_fix
Using the example files from above I get the following
## Create database
flextaxd --database test.db --taxonomy_file taxonomy.tsv --taxonomy_type QIIME --genomeid2taxid annotation.tsv --genomes_path genomes
## dump taxonomy file
flextaxd --database test.db --dbprogram kraken2 -o taxonomy --dump
## Create kraken_db
flextaxd-create --database test.db --dbprogram kraken2 -o taxonomy --genomes_path genomes --create_db --db_name testkrakendb --dbprogram kraken2
The resulting inspect file is the following (using one ATCC genome from the respective species from NCBI as sequence in genome1.fasta and genome2.fasta resp)
0.00 0 0 U 0 unclassified
100.00 1566478 0 R 1 root
100.00 1566478 0 R1 2 cellular organisms
100.00 1566478 0 D 3 Bacteria
100.00 1566478 0 P 9 Firmicutes
100.00 1566478 311 C 10 Bacilli
57.32 897861 0 O 15 Staphylococcales
57.32 897861 0 F 16 Staphylococcaceae
57.32 897861 0 G 17 Staphylococcus
57.32 897861 897861 S 18 Staphylococcus aureus
42.66 668306 0 O 11 Lactobacillales
42.66 668306 0 F 12 Streptococcaceae
42.66 668306 0 G 13 Streptococcus
42.66 668306 668306 S 14 Streptococcus pneumoniae
Thanks for this, and sorry for the delay!
When I run your final command
## Create kraken_db
flextaxd-create --database test.db --dbprogram kraken2 -o taxonomy --genomes_path genomes --create_db --db_name testkrakendb --dbprogram kraken2
I get:
2022-04-28 11:06:02,848 CreateKrakenDatabase [ERROR] Genome names are missing
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/flextaxd/bin/flextaxd-create", line 10, in <module>
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/create_databases.py", line 236, in main
classifierDB.create_library_from_files()
File "/home/ubuntu/miniconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/CreateKrakenDatabase.py", line 239, in create_library_from_files
self.genome_names_split = self._split(self.genome_names,self.processes)
AttributeError: 'CreateKrakenDatabase' object has no attribute 'genome_names'
Any idea what I am doing differently to you?
I have no doubt this is some weird way I have formatted annotations.tsv and taxonomy.tsv but knowing which is right would be great!
Here is my taxonomy.tsv
genome1.fa d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pneumoniae
genome2.fa d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
Here is my annotations.tsv
genome1.fa Streptococcus pneumoniae
genome2.fa Staphylococcus aureus
I have tried with and without the .fa
extension and nothing seems to work :(
I have a file of the missing genomes:
$ more taxonomy/FlexTaxD.missing
genome1.fa
genome2.fa
And my sqllite3 dump is:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE nodes (
id integer PRIMARY KEY,
name text NOT NULL
);
INSERT INTO nodes VALUES(1,'root');
INSERT INTO nodes VALUES(2,'root');
INSERT INTO nodes VALUES(3,'cellular organisms');
INSERT INTO nodes VALUES(4,'Bacteria');
INSERT INTO nodes VALUES(5,'Eukaryota');
INSERT INTO nodes VALUES(6,'Archaea');
INSERT INTO nodes VALUES(7,'Viruses');
INSERT INTO nodes VALUES(8,'Other');
INSERT INTO nodes VALUES(9,'Unclassified');
INSERT INTO nodes VALUES(10,'Firmicutes');
INSERT INTO nodes VALUES(11,'Bacilli');
INSERT INTO nodes VALUES(12,'Lactobacillales');
INSERT INTO nodes VALUES(13,'Streptococcaceae');
INSERT INTO nodes VALUES(14,'Streptococcus');
INSERT INTO nodes VALUES(15,'Streptococcus pneumoniae');
INSERT INTO nodes VALUES(16,'Staphylococcales');
INSERT INTO nodes VALUES(17,'Staphylococcaceae');
INSERT INTO nodes VALUES(18,'Staphylococcus');
INSERT INTO nodes VALUES(19,'Staphylococcus aureus');
CREATE TABLE tree (
parent integer NOT NULL,
child integer NOT NULL,
rank_i integer,
FOREIGN KEY (parent) REFERENCES nodes (id),
FOREIGN KEY (child) REFERENCES nodes (id),
FOREIGN KEY (rank_i) REFERENCES rank (rank_i),
unique (parent, child,rank_i)
);
INSERT INTO tree VALUES(1,1,1);
INSERT INTO tree VALUES(2,2,2);
INSERT INTO tree VALUES(2,3,2);
INSERT INTO tree VALUES(3,4,3);
INSERT INTO tree VALUES(3,5,3);
INSERT INTO tree VALUES(3,6,3);
INSERT INTO tree VALUES(2,7,3);
INSERT INTO tree VALUES(2,8,2);
INSERT INTO tree VALUES(2,9,2);
INSERT INTO tree VALUES(4,10,9);
INSERT INTO tree VALUES(10,11,8);
INSERT INTO tree VALUES(11,12,7);
INSERT INTO tree VALUES(12,13,6);
INSERT INTO tree VALUES(13,14,5);
INSERT INTO tree VALUES(14,15,4);
INSERT INTO tree VALUES(11,16,7);
INSERT INTO tree VALUES(16,17,6);
INSERT INTO tree VALUES(17,18,5);
INSERT INTO tree VALUES(18,19,4);
CREATE TABLE genomes (
id integer NOT NULL,
genome VARCHAR(255) NOT NULL UNIQUE,
FOREIGN KEY (id) REFERENCES nodes (id)
);
INSERT INTO genomes VALUES(15,'genome1.fa');
INSERT INTO genomes VALUES(19,'genome2.fa');
CREATE TABLE rank (
rank_i integer PRIMARY KEY,
rank VARCHAR(15)
);
INSERT INTO rank VALUES(1,'no rank');
INSERT INTO rank VALUES(2,'no rank');
INSERT INTO rank VALUES(3,'superkingdom');
INSERT INTO rank VALUES(4,'species');
INSERT INTO rank VALUES(5,'genus');
INSERT INTO rank VALUES(6,'family');
INSERT INTO rank VALUES(7,'order');
INSERT INTO rank VALUES(8,'class');
INSERT INTO rank VALUES(9,'phylum');
INSERT INTO rank VALUES(10,'domain');
COMMIT;
This bug should be resolved with the new release v0.4.3
Many thanks for creating flextaxd, I was looking in the documentation for how to build a kraken2 database from a completely custom taxonomy, and I couldn't figure it out.
Let's say for example I have two custom genomes, genome1.fasta and genome2.fasta
My taxonomy.tsv looks like this:
My annotation.tsv looks like this:
I can build the flextaxd DB and dump out kraken2 format taxonomy like this:
BUT this just dumps out kraken2 taxonomy, as far as I can tell, I still need to manually edit the genome files to have the correct taxonomy ID in their header, correct?
Am I missing something? Is this functionality already in flextaxd?
Thanks Mick