FOI-Bioinformatics / flextaxd

FlexTaxD (Flexible Taxonomy Databases) - Create, add, merge different taxonomy sources (QIIME, GTDB, NCBI and more) and create metagenomic databases (kraken2, ganon and more )
GNU General Public License v3.0
64 stars 7 forks source link

Create completely custom Kraken2 database #51

Closed mw55309 closed 2 years ago

mw55309 commented 2 years ago

Many thanks for creating flextaxd, I was looking in the documentation for how to build a kraken2 database from a completely custom taxonomy, and I couldn't figure it out.

Let's say for example I have two custom genomes, genome1.fasta and genome2.fasta

My taxonomy.tsv looks like this:

genome1.fasta   d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pneumoniae
genome2.fasta   d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus

My annotation.tsv looks like this:

genome1.fasta   Streptococcus pneumoniae
genome2.fasta   Staphylococcus aureus

I can build the flextaxd DB and dump out kraken2 format taxonomy like this:

flextaxd --database test.db --taxonomy_file taxonomy.tsv --taxonomy_type QIIME --genomeid2taxid annotation.tsv --genomes_path genomes

flextaxd --database test.db --dbprogram kraken2 -o taxonomy --dump --genomes_path genomes

BUT this just dumps out kraken2 taxonomy, as far as I can tell, I still need to manually edit the genome files to have the correct taxonomy ID in their header, correct?

Am I missing something? Is this functionality already in flextaxd?

Thanks Mick

davve2 commented 2 years ago

Dear Mick,

Sorry for a delayed response, it has been some busy days.

I went through your code and found an error. However in your example you are missing the actual create database part of flextaxd? So flextaxd does have the functionality to add kraken headers before creating the database. Just make sure that your whole dataset has unique sequence ids.

The following command will create a kraken2 database flextaxd-create --database test.db --dbprogram kraken2 -o taxonomy --genomes_path genomes --create_db --db_name testkraken_db kraken2

The error that I found was caused by a bug where i use strip() instead of rstrip() for gzip extensions, hence your genome will be renamed to enome and not match. Using genome names not starting with small g did work.

I will try to get out a new release that solved the bug. Until then if you need to use small g(or z) at the start of your sequences please use the branch name_bug_issue51_fix

davve2 commented 2 years ago

Using the example files from above I get the following

## Create database
flextaxd --database test.db --taxonomy_file taxonomy.tsv --taxonomy_type QIIME --genomeid2taxid annotation.tsv --genomes_path genomes

## dump taxonomy file
flextaxd --database test.db --dbprogram kraken2 -o taxonomy --dump

## Create kraken_db
flextaxd-create --database test.db --dbprogram kraken2 -o taxonomy --genomes_path genomes --create_db --db_name testkrakendb --dbprogram kraken2

The resulting inspect file is the following (using one ATCC genome from the respective species from NCBI as sequence in genome1.fasta and genome2.fasta resp)

 0.00  0       0       U       0       unclassified
100.00  1566478 0       R       1       root
100.00  1566478 0       R1      2         cellular organisms
100.00  1566478 0       D       3           Bacteria
100.00  1566478 0       P       9             Firmicutes
100.00  1566478 311     C       10              Bacilli
 57.32  897861  0       O       15                Staphylococcales
 57.32  897861  0       F       16                  Staphylococcaceae
 57.32  897861  0       G       17                    Staphylococcus
 57.32  897861  897861  S       18                      Staphylococcus aureus
 42.66  668306  0       O       11                Lactobacillales
 42.66  668306  0       F       12                  Streptococcaceae
 42.66  668306  0       G       13                    Streptococcus
 42.66  668306  668306  S       14                      Streptococcus pneumoniae
mw55309 commented 2 years ago

Thanks for this, and sorry for the delay!

When I run your final command

## Create kraken_db
flextaxd-create --database test.db --dbprogram kraken2 -o taxonomy --genomes_path genomes --create_db --db_name testkrakendb --dbprogram kraken2

I get:

2022-04-28 11:06:02,848 CreateKrakenDatabase [ERROR]  Genome names are missing
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/flextaxd/bin/flextaxd-create", line 10, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/create_databases.py", line 236, in main
    classifierDB.create_library_from_files()
  File "/home/ubuntu/miniconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/CreateKrakenDatabase.py", line 239, in create_library_from_files
    self.genome_names_split = self._split(self.genome_names,self.processes)
AttributeError: 'CreateKrakenDatabase' object has no attribute 'genome_names'

Any idea what I am doing differently to you?

mw55309 commented 2 years ago

I have no doubt this is some weird way I have formatted annotations.tsv and taxonomy.tsv but knowing which is right would be great!

Here is my taxonomy.tsv

genome1.fa d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pneumoniae
genome2.fa d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus

Here is my annotations.tsv

genome1.fa   Streptococcus pneumoniae
genome2.fa   Staphylococcus aureus

I have tried with and without the .fa extension and nothing seems to work :(

I have a file of the missing genomes:

$ more taxonomy/FlexTaxD.missing
genome1.fa
genome2.fa

And my sqllite3 dump is:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE nodes (
                                            id integer PRIMARY KEY,
                                            name text NOT NULL
                                        );
INSERT INTO nodes VALUES(1,'root');
INSERT INTO nodes VALUES(2,'root');
INSERT INTO nodes VALUES(3,'cellular organisms');
INSERT INTO nodes VALUES(4,'Bacteria');
INSERT INTO nodes VALUES(5,'Eukaryota');
INSERT INTO nodes VALUES(6,'Archaea');
INSERT INTO nodes VALUES(7,'Viruses');
INSERT INTO nodes VALUES(8,'Other');
INSERT INTO nodes VALUES(9,'Unclassified');
INSERT INTO nodes VALUES(10,'Firmicutes');
INSERT INTO nodes VALUES(11,'Bacilli');
INSERT INTO nodes VALUES(12,'Lactobacillales');
INSERT INTO nodes VALUES(13,'Streptococcaceae');
INSERT INTO nodes VALUES(14,'Streptococcus');
INSERT INTO nodes VALUES(15,'Streptococcus pneumoniae');
INSERT INTO nodes VALUES(16,'Staphylococcales');
INSERT INTO nodes VALUES(17,'Staphylococcaceae');
INSERT INTO nodes VALUES(18,'Staphylococcus');
INSERT INTO nodes VALUES(19,'Staphylococcus aureus');
CREATE TABLE tree (
                                        parent integer NOT NULL,
                                        child integer NOT NULL,
                                        rank_i integer,
                                        FOREIGN KEY (parent) REFERENCES nodes (id),
                                        FOREIGN KEY (child) REFERENCES nodes (id),
                                        FOREIGN KEY (rank_i) REFERENCES rank (rank_i),
                                        unique (parent, child,rank_i)
                                    );
INSERT INTO tree VALUES(1,1,1);
INSERT INTO tree VALUES(2,2,2);
INSERT INTO tree VALUES(2,3,2);
INSERT INTO tree VALUES(3,4,3);
INSERT INTO tree VALUES(3,5,3);
INSERT INTO tree VALUES(3,6,3);
INSERT INTO tree VALUES(2,7,3);
INSERT INTO tree VALUES(2,8,2);
INSERT INTO tree VALUES(2,9,2);
INSERT INTO tree VALUES(4,10,9);
INSERT INTO tree VALUES(10,11,8);
INSERT INTO tree VALUES(11,12,7);
INSERT INTO tree VALUES(12,13,6);
INSERT INTO tree VALUES(13,14,5);
INSERT INTO tree VALUES(14,15,4);
INSERT INTO tree VALUES(11,16,7);
INSERT INTO tree VALUES(16,17,6);
INSERT INTO tree VALUES(17,18,5);
INSERT INTO tree VALUES(18,19,4);
CREATE TABLE genomes (
                                            id integer NOT NULL,
                                            genome VARCHAR(255) NOT NULL UNIQUE,
                                            FOREIGN KEY (id) REFERENCES nodes (id)
                                        );
INSERT INTO genomes VALUES(15,'genome1.fa');
INSERT INTO genomes VALUES(19,'genome2.fa');
CREATE TABLE rank (
                                        rank_i integer PRIMARY KEY,
                                        rank VARCHAR(15)
                                    );
INSERT INTO rank VALUES(1,'no rank');
INSERT INTO rank VALUES(2,'no rank');
INSERT INTO rank VALUES(3,'superkingdom');
INSERT INTO rank VALUES(4,'species');
INSERT INTO rank VALUES(5,'genus');
INSERT INTO rank VALUES(6,'family');
INSERT INTO rank VALUES(7,'order');
INSERT INTO rank VALUES(8,'class');
INSERT INTO rank VALUES(9,'phylum');
INSERT INTO rank VALUES(10,'domain');
COMMIT;
davve2 commented 2 years ago

This bug should be resolved with the new release v0.4.3