Closed SilasK closed 2 years ago
all the files nodes.dmp library.fna seem ok
@davve2 could, please you help me with this.
I reverted back to version
custom_taxonomy_databases: version 0.3.5 Kraken version 2.1.1
and now it seems to work. I'm happy to give infos if it can help to solve it.
By the way. also the above log comes from the stderr. I think it would be good to have the kreken-build log also in the flextaxd logfile not only in stderr.
Thank you SilasK,
Good suggestion to include the kraken-build log, I will add that to the next version of flextaxd.
I will look into this error, did you change anything in the source files between the two runs or were all files, taxonomy, library etc the same? If not I will try to produce files locally from the two versions and see if I can reproduce the error (and the completed database).
Can you replicate the error using the --test parameter? This will help a lot during debug (It uses only a handfull of genomes to run through the pipeline within a few minutes).
I did both rerunning flextaxd-create from the already existing library and mapping (I don't know what files the script updates and which one not).
But it run into the same error. Only if create the flextaxd database with the older version the kraken db is build correctyl.
By the way here is my code https://github.com/SilasK/Kraken/blob/master/workflow/build.smk
I use a snakemake and usually start from a green genes formated file.
Dear @SilasK
I´m still working on this issue, I may have found a bug related to the import of greengenes official file. Sometimes emtpy nodes (g;s) will lead to an annotation of a node "" of multiple genomes, I´ve added a solution locally and will push an update, but this is unrelated to your problem. Preferably I want to understand and add a solution to this issue as well before I push the update. However, I cannot replicate the issue that you have, with kraken not use the files on disk.
How is your genome structure looking? At the moment (unfortunately, it is on my own list of updates) the program cannot take one single large file of genomes. This is originally due to the structure of the NCBI genomeid2taxid file that doesn´t give you an identification to the genome name.
I have worked with greengenes and have it working locally using the following structure on my input genomes (splitting the original fasta file into files with "taxid.fasta.gz"
genomes
├── 1000000.fasta.gz
├── 1000001.fasta.gz
├── 1000002.fasta.gz
├── 1000003.fasta.gz
├── 1000004.fasta.gz
├── 1000005.fasta.gz
├── 1000006.fasta.gz
├── 1000007.fasta.gz
├── 1000008.fasta.gz
├── 1000009.fasta.gz
Please let me know if this is of any help and otherwise perhaps you can supply some example data that I can work with to replicate the problem.
Kind regards, David
Just to say, I got a similar result as @SilasK , but from a different route.
I basically created the custom taxonomy with flextaxd, then manually edited the genome fastas to have the relevant "kraken:taxid" in the header
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.362s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 37427200 bytes
Capacity estimation complete. [2.154s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 6 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp
Writing data to disk... complete.
Database files completed. [1.754s]
Database construction complete. [Total: 5.115s]
The kraken db "looks" OK:
drwxrwxrwx 1 ubuntu ubuntu 4096 Apr 28 16:31 library
drwxrwxrwx 1 ubuntu ubuntu 4096 Apr 28 16:31 taxonomy
-rwxrwxrwx 1 ubuntu ubuntu 39643 Apr 28 16:31 seqid2taxid.map
-rwxrwxrwx 1 ubuntu ubuntu 2506 Apr 28 16:31 taxo.k2d
-rwxrwxrwx 1 ubuntu ubuntu 37427232 Apr 28 16:31 hash.k2d
-rwxrwxrwx 1 ubuntu ubuntu 64 Apr 28 16:31 opts.k2d
seqid2taxid.map is populated:
CHK98__C4784_L=3121=|kraken:taxid|65 65
CHK98__C8377_L=39070=|kraken:taxid|65 65
CHK98__C19089_L=5205=|kraken:taxid|65 65
CHK98__C24437_L=26986=|kraken:taxid|65 65
CHK98__C30485_L=14718=|kraken:taxid|65 65
CHK98__C33697_L=12368=|kraken:taxid|65 65
etc
Yet the build clearly didn't work:
Completed processing of 0 sequences, 0 bp
I am using
custom_taxonomy_databases: version 0.4.2
Kraken version 2.1.1
Has anything changed in the way flextaxd dumps the taxonomy to file?
Downgraded to
custom_taxonomy_databases: version 0.3.5
and got a successful build
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.179s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 37427200 bytes
Capacity estimation complete. [1.249s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 6 bits reserved for taxid.
Completed processing of 911 sequences, 20708776 bp
Writing data to disk... complete.
Database files completed. [5.798s]
Database construction complete. [Total: 7.487s]
Crucailly this bit:
Completed processing of 911 sequences, 20708776 bp
So what changed between the dump of nodes.dmp and names.dmp between the two versions?
What I can see is that the newer .dmp files have one extra line:
#newer
1765 innovad_simple_kraken2/taxonomy/names.dmp
1765 innovad_simple_kraken2/taxonomy/nodes.dmp
#older
1764 innovad_simple_kraken2_old/taxonomy/names.dmp
1764 innovad_simple_kraken2_old/taxonomy/nodes.dmp
There are two root nodes in the newer names.dmp:
1 | root | | scientific name |
2 | root | | scientific name |
3 | cellular organisms | | scientific name |
4 | Bacteria | | scientific name |
5 | Eukaryota | | scientific name |
6 | Archaea | | scientific name |
7 | Viruses | | scientific name |
8 | Other | | scientific name |
9 | Unclassified | | scientific name |
10 | Methanobacteriota | | scientific name |
Only one in the older
1 | root | | scientific name |
2 | cellular organisms | | scientific name |
3 | Bacteria | | scientific name |
4 | Eukaryota | | scientific name |
5 | Archaea | | scientific name |
6 | Viruses | | scientific name |
7 | Other | | scientific name |
8 | Unclassified | | scientific name |
9 | Methanobacteriota | | scientific name |
10 | Methanobacteria | | scientific name |
Looking at nodes.dmp for the new version (where names.dmp has two root nodes) it looks like the first root node has nothing hanging off it:
1 | 1 | no rank | | |
2 | 2 | no rank | | |
3 | 2 | no rank | | |
4 | 3 | superkingdom | | |
5 | 3 | superkingdom | | |
6 | 3 | superkingdom | | |
7 | 2 | superkingdom | | |
8 | 2 | no rank | | |
9 | 2 | no rank | | |
10 | 6 | phylum | | |
11 | 10 | class | | |
12 | 11 | order | | |
13 | 12 | family | | |
14 | 13 | genus | | |
15 | 14 | species | | |
So I would suggest this is the first place to look for the bug
Dear Mick,
Thanks for the information, I located how the bug happens and have implemented a fix. I have a few additional updates coming very soon which will include a bugfix for this issue. I hope to get it updated today or during next week.
Best, David
Thanks @davve2, keep up the excellent work and thanks for flextaxd!
Did you manage to fix the bug?
Did you manage to fix the bug?
Yes this should now be resolved with the latest release (v0.4.3), it was created by two minor bugs, one adding two roots on join, another leading to incorrect taxonomy levels. However, it is also important to remember to export the taxonomy using --dbprogram kraken2. Otherwise the final database will not retain the information (Kraken trims at minimum one column from the right, the default export format for flextaxd contain the node information in the last column).
I try to build a custom kreken db flextaxd creates an empty kreken db and does not even fail.
I have latest flextaxd 0.4.2 and kraken 2.1.2
Could you help me to solve this issue.