Flextaxd creates empty kraken db

SilasK commented 2 years ago

I try to build a custom kreken db flextaxd creates an empty kreken db and does not even fail.

I have latest flextaxd 0.4.2 and kraken 2.1.2

Could you help me to solve this issue.

2021-11-24 08:24:45,172 create_databases [INFO ]  Loading module: CreateKrakenDatabase
2021-11-24 08:24:45,698 create_databases [INFO ]  Get genomes from input directory!
2021-11-24 08:24:45,700 DatabaseConnection [DEBUG]  Connecting to uhgg/flextaxd/flextaxd.ftd
2021-11-24 08:24:45,708 DatabaseConnection [INFO ]  uhgg/flextaxd/flextaxd.ftd opened successfully.
2021-11-24 08:24:45,709 DatabaseConnection [DEBUG]  cursor created.
2021-11-24 08:24:45,710 DatabaseConnection [DEBUG]  Load DatabaseFunctions
2021-11-24 08:24:45,726 DatabaseConnection [DEBUG]  SELECT id,name FROM nodes
2021-11-24 08:24:45,739 CreateKrakenDatabase [INFO ]  /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg
2021-11-24 08:24:45,740 create_databases [INFO ]  --- process finished in 0 minutes 0.8386135101318359 seconds---

2021-11-24 08:24:45,742 CreateKrakenDatabase [INFO ]  Create library directory
2021-11-24 08:24:45,744 CreateKrakenDatabase [INFO ]  Processing files; create kraken seq.map
2021-11-24 08:27:27,430 CreateKrakenDatabase [INFO ]  Number of genomes succesfully added to the kraken2 database: 4616
2021-11-24 08:27:27,438 create_databases [INFO ]  Genome folder preprocessing completed!
2021-11-24 08:27:27,439 create_databases [INFO ]  --- process finished in 2 minutes 42.5374813079834 seconds---

2021-11-24 08:27:27,440 create_databases [INFO ]  Create database
2021-11-24 08:27:27,442 CreateKrakenDatabase [INFO ]  mkdir -p /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg/taxonomy
2021-11-24 08:27:27,456 CreateKrakenDatabase [INFO ]  cp uhgg/flextaxd/*.dmp /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg/taxonomy
2021-11-24 08:27:27,727 CreateKrakenDatabase [INFO ]  cp uhgg/flextaxd/*.map /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg
2021-11-24 08:27:27,728 CreateKrakenDatabase [INFO ]  kraken2-build --build --db /srv/beegfs/scratch/users/k/kiesers/Kraken/uhgg  --threads 8
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.489s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 16802807220 bytes
Capacity estimation complete. [55.311s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 13 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp
Writing data to disk...  complete.
Database files completed. [1m6.379s]
Database construction complete. [Total: 2m2.357s]
2021-11-24 08:29:33,807 CreateKrakenDatabase [INFO ]  Create inspect file!
2021-11-24 08:29:51,538 CreateKrakenDatabase [INFO ]  kraken2 database created
2021-11-24 08:29:51,539 create_databases [INFO ]  --- Time summary  5 minutes 6.63733434677124 seconds---
2021-11-24 08:29:51,539 create_databases [DEBUG]  1637738991.53929

SilasK commented 2 years ago

all the files nodes.dmp library.fna seem ok

@davve2 could, please you help me with this.

SilasK commented 2 years ago

I reverted back to version

custom_taxonomy_databases: version 0.3.5 Kraken version 2.1.1

and now it seems to work. I'm happy to give infos if it can help to solve it.

SilasK commented 2 years ago

By the way. also the above log comes from the stderr. I think it would be good to have the kreken-build log also in the flextaxd logfile not only in stderr.

davve2 commented 2 years ago

Thank you SilasK,

Good suggestion to include the kraken-build log, I will add that to the next version of flextaxd.

I will look into this error, did you change anything in the source files between the two runs or were all files, taxonomy, library etc the same? If not I will try to produce files locally from the two versions and see if I can reproduce the error (and the completed database).

Can you replicate the error using the --test parameter? This will help a lot during debug (It uses only a handfull of genomes to run through the pipeline within a few minutes).

SilasK commented 2 years ago

I did both rerunning flextaxd-create from the already existing library and mapping (I don't know what files the script updates and which one not).

But it run into the same error. Only if create the flextaxd database with the older version the kraken db is build correctyl.

By the way here is my code https://github.com/SilasK/Kraken/blob/master/workflow/build.smk

I use a snakemake and usually start from a green genes formated file.

davidsundell commented 2 years ago

Dear @SilasK

I´m still working on this issue, I may have found a bug related to the import of greengenes official file. Sometimes emtpy nodes (g;s) will lead to an annotation of a node "" of multiple genomes, I´ve added a solution locally and will push an update, but this is unrelated to your problem. Preferably I want to understand and add a solution to this issue as well before I push the update. However, I cannot replicate the issue that you have, with kraken not use the files on disk.

How is your genome structure looking? At the moment (unfortunately, it is on my own list of updates) the program cannot take one single large file of genomes. This is originally due to the structure of the NCBI genomeid2taxid file that doesn´t give you an identification to the genome name.

I have worked with greengenes and have it working locally using the following structure on my input genomes (splitting the original fasta file into files with "taxid.fasta.gz"

genomes
├── 1000000.fasta.gz
├── 1000001.fasta.gz
├── 1000002.fasta.gz
├── 1000003.fasta.gz
├── 1000004.fasta.gz
├── 1000005.fasta.gz
├── 1000006.fasta.gz
├── 1000007.fasta.gz
├── 1000008.fasta.gz
├── 1000009.fasta.gz

Please let me know if this is of any help and otherwise perhaps you can supply some example data that I can work with to replicate the problem.

Kind regards, David

mw55309 commented 2 years ago

Just to say, I got a similar result as @SilasK , but from a different route.

I basically created the custom taxonomy with flextaxd, then manually edited the genome fastas to have the relevant "kraken:taxid" in the header

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.362s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 37427200 bytes
Capacity estimation complete. [2.154s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 6 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp
Writing data to disk...  complete.
Database files completed. [1.754s]
Database construction complete. [Total: 5.115s]

The kraken db "looks" OK:

drwxrwxrwx 1 ubuntu ubuntu     4096 Apr 28 16:31 library
drwxrwxrwx 1 ubuntu ubuntu     4096 Apr 28 16:31 taxonomy
-rwxrwxrwx 1 ubuntu ubuntu    39643 Apr 28 16:31 seqid2taxid.map
-rwxrwxrwx 1 ubuntu ubuntu     2506 Apr 28 16:31 taxo.k2d
-rwxrwxrwx 1 ubuntu ubuntu 37427232 Apr 28 16:31 hash.k2d
-rwxrwxrwx 1 ubuntu ubuntu       64 Apr 28 16:31 opts.k2d

seqid2taxid.map is populated:

CHK98__C4784_L=3121=|kraken:taxid|65    65
CHK98__C8377_L=39070=|kraken:taxid|65   65
CHK98__C19089_L=5205=|kraken:taxid|65   65
CHK98__C24437_L=26986=|kraken:taxid|65  65
CHK98__C30485_L=14718=|kraken:taxid|65  65
CHK98__C33697_L=12368=|kraken:taxid|65  65
etc

Yet the build clearly didn't work:

Completed processing of 0 sequences, 0 bp

I am using

custom_taxonomy_databases: version 0.4.2
Kraken version 2.1.1

Has anything changed in the way flextaxd dumps the taxonomy to file?

mw55309 commented 2 years ago

Downgraded to

custom_taxonomy_databases: version 0.3.5

and got a successful build

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.179s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 37427200 bytes
Capacity estimation complete. [1.249s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 6 bits reserved for taxid.
Completed processing of 911 sequences, 20708776 bp
Writing data to disk...  complete.
Database files completed. [5.798s]
Database construction complete. [Total: 7.487s]

Crucailly this bit:

Completed processing of 911 sequences, 20708776 bp

So what changed between the dump of nodes.dmp and names.dmp between the two versions?

What I can see is that the newer .dmp files have one extra line:

#newer
  1765 innovad_simple_kraken2/taxonomy/names.dmp
  1765 innovad_simple_kraken2/taxonomy/nodes.dmp

#older 
  1764 innovad_simple_kraken2_old/taxonomy/names.dmp
  1764 innovad_simple_kraken2_old/taxonomy/nodes.dmp

mw55309 commented 2 years ago

There are two root nodes in the newer names.dmp:

1       |       root    |               |       scientific name |
2       |       root    |               |       scientific name |
3       |       cellular organisms      |               |       scientific name |
4       |       Bacteria        |               |       scientific name |
5       |       Eukaryota       |               |       scientific name |
6       |       Archaea |               |       scientific name |
7       |       Viruses |               |       scientific name |
8       |       Other   |               |       scientific name |
9       |       Unclassified    |               |       scientific name |
10      |       Methanobacteriota       |               |       scientific name |

Only one in the older

1       |       root    |               |       scientific name |
2       |       cellular organisms      |               |       scientific name |
3       |       Bacteria        |               |       scientific name |
4       |       Eukaryota       |               |       scientific name |
5       |       Archaea |               |       scientific name |
6       |       Viruses |               |       scientific name |
7       |       Other   |               |       scientific name |
8       |       Unclassified    |               |       scientific name |
9       |       Methanobacteriota       |               |       scientific name |
10      |       Methanobacteria |               |       scientific name |

mw55309 commented 2 years ago

Looking at nodes.dmp for the new version (where names.dmp has two root nodes) it looks like the first root node has nothing hanging off it:

1       |       1       |       no rank |               |               |
2       |       2       |       no rank |               |               |
3       |       2       |       no rank |               |               |
4       |       3       |       superkingdom    |               |               |
5       |       3       |       superkingdom    |               |               |
6       |       3       |       superkingdom    |               |               |
7       |       2       |       superkingdom    |               |               |
8       |       2       |       no rank |               |               |
9       |       2       |       no rank |               |               |
10      |       6       |       phylum  |               |               |
11      |       10      |       class   |               |               |
12      |       11      |       order   |               |               |
13      |       12      |       family  |               |               |
14      |       13      |       genus   |               |               |
15      |       14      |       species |               |               |

So I would suggest this is the first place to look for the bug

davve2 commented 2 years ago

Dear Mick,

Thanks for the information, I located how the bug happens and have implemented a fix. I have a few additional updates coming very soon which will include a bugfix for this issue. I hope to get it updated today or during next week.

Best, David

mw55309 commented 2 years ago

Thanks @davve2, keep up the excellent work and thanks for flextaxd!

SilasK commented 2 years ago

Did you manage to fix the bug?

davve2 commented 2 years ago

Did you manage to fix the bug?

Yes this should now be resolved with the latest release (v0.4.3), it was created by two minor bugs, one adding two roots on join, another leading to incorrect taxonomy levels. However, it is also important to remember to export the taxonomy using --dbprogram kraken2. Otherwise the final database will not retain the information (Kraken trims at minimum one column from the right, the default export format for flextaxd contain the node information in the last column).

FOI-Bioinformatics / flextaxd

Flextaxd creates empty kraken db #48