Building Kraken2 compatible GTDB only database

HitMonk commented 3 years ago

Hello, Thank you so much for this tool! I have been using it most of today to merge the taxonomies of archaea and bacteria from GTDB and protozoa, viruses, fungi and chlorophyta genomes from NCBI into one database. However, I seem to have issues: 1) merge these files: Since i have no archaea and bacterial sequences from NCBI Command: flextaxd -db db_file_FTD/NCBI_GTDB_merge.db -md db_file_FTD/ar122_gtdb.db --parent Archaea --replace --verbose --logs NCBI_GTDB_merge_log/ar122_rep.log Error: only the final line modules.database.DatabaseConnection.NameError: 'Name not found in the database! Archaea'

Does this mean that i have to download all the genomes for archaea and bacteria for NCBI too before i merge it with GTDB?

2) I also have another issue when creating kraken2 compatible 16s database from GTDB sequences. I only have 2 files for this data: a) A single fasta file with 16s sequences from Archaea and Bacterial genomes as shown below: `>Archaea;Crenarchaeota;Korarchaeia;Korarchaeales;Korarchaeaceae;Korarchaeum;Korarchaeum_cryptofilum(RS_GCF_000019605.1) GGTTGATCCTGCCGGAGGGAACCCCTATCGGGCTCGCA ……..

Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Staphylothermus;Staphylothermus_hellenicus(RS_GCF_000092465.1) AGGTGATCCAGCCGCAGGTTCCCCTACGGCTACCTTGTT …….`

b) Taxonomy file, format as shown below: Archaea;Crenarchaeota;Korarchaeia;Korarchaeales;Korarchaeaceae;Korarchaeum;Korarchaeum_cryptofilum(RS_GCF_000019605.1) Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Desulfurococcaceae;Staphylothermus;Staphylothermus_hellenicus(RS_GCF_000092465.1) Archaea;Halobacterota;Methanosarcinia;Methanosarcinales;Methanosarcinaceae;Methanosarcina;Methanosarcina_mazei(RS_GCF_000970205.1) Archaea;Euryarchaeota;Methanobacteria;Methanobacteriales;Methanobacteriaceae;Methanobacterium;Methanobacterium_formicicum_A(RS_GCF_000302455.1) Archaea;Crenarchaeota;Thermoprotei;Thermofilales;Thermofilaceae;Thermofilum_A;Thermofilum_A_uzonense(RS_GCF_000993805.1) Archaea;Halobacterota;Halobacteria;Halobacteriales;Haloferacaceae;Halonotius;(GB_GCA_000416025.1) Archaea;Crenarchaeota;Nitrososphaeria;Nitrososphaerales;Nitrosopumilaceae;Nitrosopumilus;(GB_GCA_000484975.1) Archaea;Crenarchaeota;Thermoprotei;Desulfurococcales;Ignicoccaceae;Ignicoccus_A;Ignicoccus_A_hospitalis(UBA8868) Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanocullaceae;Methanoculleus;Methanoculleus_sp2(UBA8928) I was wondering if it was possible to use these two files to generate a kraken2 compatible 16s database?

davve2 commented 3 years ago

Hello HitMonk, thank you!

Let´s see if we can find a solution without any changes to the code first.

For your first issue: There are two solutions, either

Create a dummy database with root - Bacteria and root- Archaea, or perhaps more simple
Follow the walkthrough to create an NCBi database and clean the NCBI database using --taxonomy_type NCBI(names.dmp and nodes.dmp) see Walkthough section 2.1 and 2.3. Specifying taxonomy type on clean will force the database to keep the top three leves of nodes (which includes Archaea and Bacteria).

However if you do want to have genomes annotated to any other node in the NCBI database (that you wish not to replace) the easiest solution is to add them when building your NCBI database. How to do this is also described in the walkthrough 2.1 (You can limit the number of annotations as the program will only annotate genomes in your input folder, not all sequences in the NCBI nucl_gb.accession2taxid).

For the second issue: we will have to modify your files slightly, there is a possibility to import your tree directly (removing the genomes at the end) and using --taxonomy_type CanSNPer.

Move your annotations to a separate file genomeid2taxid.txt (see Formats for details. Short example
```
genome_id\tnode_name
GCF_000019605.1\tKorarchaeum_cryptofilum  
```
Then remove the genome ids from your tree file and load your tree using --taxonomy_type CanSNPer. Add your genome annotation using --genomeid2taxid geomeid2taxid.txt

flextaxd --database archaea_custom.db --taxonomy_file <path_to_tree> --taxonomy_type CanSNPer --genomeid2taxid <id_file>

After this you can follow Section 3 in the https://github.com/FOI-Bioinformatics/flextaxd/wiki/Walkthrough---merge-NCBI-with-GTDB on how to merge your databases.

HitMonk commented 3 years ago

Thank you so much for the detailed explanation. From what i can tell, of all the tools that have come out so far Flextaxd is one of the more easier ways to solve this taxonomy discrepancy. Ill work on your tips over the next few days and get back to you.

HitMonk commented 3 years ago

Hello, apologies for reopening this issue again, but i still havent managed to build kraken2 compatible 16s GTDB database. I wanted to revisit this since it looks like both Flextaxd and the GTDB have been now updated. the format of my files have also changed.

I still have only two files and i have formatted them according to the previous discussion: 1) A taxonomy file example: Archaea;Halobacteriota;Methanosarcinia;Methanosarcinales;Methanosarcinaceae;Methanosarcina;Methanosarcina mazei Archaea;Thermoproteota;Thermoproteia;Sulfolobales;Sulfolobaceae;Sulfolobus;Sulfolobus acidocaldarius Archaea;Halobacteriota;Bog-38;Bog-38;Bog-38;Bog-38;Bog-38 sp003162175 Archaea;Thermoproteota;Thermoproteia;Sulfolobales;Sulfolobaceae;Saccharolobus;Saccharolobus islandicus Archaea;Thermoplasmatota;Thermoplasmata;Thermoplasmatales;Thermoplasmataceae;Cuniculiplasma;Cuniculiplasma divulgatum 2) A fasta file with all the 16s sequences that i would like to use to build the database `>GB_GCA_000010565_1 Pelotomaculum thermopropionicum GAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAGCGGTTCGGCACTCAGCCACTA...........

GB_GCA_000018565_1 Herpetosiphon aurantiacus AGGAGGTGATCCAGCCACACCTTCCGGTACGGCTACCTTGTTACGACTTCGTCCCAGTTACGAGCCCCACCCTCGGCCGCT ......`

3)Genomeid2taxid file: genome_id\tnode_name GB_GCA_000010565_1\tPelotomaculum thermopropionicum

Here is what i have done so far: 1) Create flextaxd database from both the files that i have flextaxd --database 16GTDB.db --taxonomy_file tree_file.txt --taxonomy_type CanSNPer --genomeid2taxid g2tax.txt

but i get an error at this stage: `Traceback (most recent call last): File "/exports/watson/Prateek/apps/conda/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/ReadTaxonomyCanSNPer.py", line 95, in parse_taxonomy parent_i = self.taxonomy[pname] ## Check if parent of child node exists KeyError: 'Escherichia'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/exports/watson/Prateek/apps/conda/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/ReadTaxonomyCanSNPer.py", line 72, in add_SNP parent_i = self.taxonomy[nodes[i-1]] ## check if current nodes parent exists KeyError: 'Enterobacteriaceae' `

Do you have any suggestions on how i can resolve this? also, going forward, the I see that after i create the flextaxd database, i will use that to create the kraken2 database. However, in that step, the command asks for a path with all the genomes in it. I only have 1 fasta file with all the 16s sequences. would this be a problem?

davve2 commented 3 years ago

Hello HitMonk, no worries at all. FlexTaxD is still under development and I´m happy to get feedback on problems that can occur outside the area I´m using FlexTaxD for myself. So feel free to raise any issues that you have!

The error occur during the build sequence of the taxonomy tree. This will be the first thing that happens.

I tried with the example tree you supplied and I have no problems building that tree.
From what I can see in the error message it seems most likely that your tree structure is inconsistent?
[x] Is it possible to post the lines in your tree file with Escherichia and/or Enterobacteriaceae nodes.

The genome_id must match the name of your fasta file on disk (not the header) maybe this is what you did but you didn´t specify the name of the file in the example and since the error occured earlier I´m not sure this will be an error.
- However, the names of your files in the example annotation file that you gave will have to be GB_GCA_000010565_1.fna/fasta/fa in the --genome_path directory when running flextaxd-build.

Second reading your first message again I wonder if it is enough to merge the existing GTDB databases with the official NCBI database? If so I would suggest that you follow the procedure from start in the walkthrough. If you want to keep the only the protozoa, viruses, fungi and chlorophyta simply have only those genomes in the input genome_path directory (or choose to only download the relevant groups) when building the NCBI database.

HitMonk commented 3 years ago

Here are the lines when i pull out only the Escherichia names (flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ grep "Escherichia" tree_file.txt Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_flexneri Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_coli Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_dysenteriae Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_coli_D Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_albertii Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_marmotae Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_coli_C Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp005843885 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp000208585 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_fergusonii Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp001660175 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp004211955 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp002965065

Also i got my files from here, if you would like the original files. I also tried the original taxonomy file (GTDB_arc_bact_taxo.txt) with the QIIME tag instead of the CanSNPer tag with the command: flextaxd --database 16GTDB.db --taxonomy_file GTDB_arc_bact_taxo.txt --taxonomy_type QIIME --genomeid2taxid g2tax.txt

Since the taxonomy files are in the QIIME format i thought it might be alright to try it. But i receive the same error that i get with the CanSNPer tag, without the error for Escherichia and/or Enterobacteriaceae nodes. I missed posting this earlier, but both runs have the same ending error message after the error about the nodes.

flextaxd --database 16GTDB.db --taxonomy_file GTDB_arc_bact_taxo.txt --taxonomy_type QIIME --genomeid2taxid g2tax.txt Traceback (most recent call last): File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/bin/flextaxd", line 10, in <module> sys.exit(main()) File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/custom_taxonomy_databases.py", line 252, in main read_obj.parse_taxonomy() ## Parse taxonomy file File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/ReadTaxonomyQIIME.py", line 61, in parse_taxonomy self.qiime_to_tree() File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/ReadTaxonomyQIIME.py", line 130, in qiime_to_tree taxonomy_i = self.parse_tree(taxonomy) File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/ReadTaxonomyQIIME.py", line 66, in parse_tree level,description = self.parse_description(tree,current_i) File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/ReadTaxonomyQIIME.py", line 94, in parse_description level, description = current_level.split("__") ValueError: not enough values to unpack (expected 2, got 1)

Also, i would like to use only 16s sequences without any genomes in the database. I also wont be adding in the protozoa, viruses and fungi as genomes. I did that separately when i was building the genome database. I would like to build this as a 16s standalone database, similar to the Greengenes, RDP and silva databases.

This also reminded me that the database that i built with FlextaxD and the genomes of viruses, protozoa and fungi worked well. but it gave me an issue when i tried to run the bracken program on it. Bracken can build the database, however it gives an a no reads found in the report file error. I dont receive this error when i run the same bracken command but with a database built with struo. Struo has been updated too, ill try to see if the error persists and set up a post with all the details in that.

davve2 commented 3 years ago

Hi, thanks for the link, this helps

The files from that resource are not on standard QIIME, nor CanSNPer format which is why the load is failing.

I also understand the error now running your example. The main problem is that it is a discontinued tree with no common root! For the database to load you will need to add a root which is common to all nodes for example

root;Archaea;Thermoproteota;Thermoproteia;Sulfolobales;Sulfolobaceae;Sulfolobus;Sulfolobus acidocaldarius
root;Archaea;Halobacteriota;Bog-38;Bog-38;Bog-38;Bog-38;Bog-38 sp003162175
root;Archaea;Thermoproteota;Thermoproteia;Sulfolobales;Sulfolobaceae;Saccharolobus;Saccharolobus islandicus
root;Archaea;Thermoplasmatota;Thermoplasmata;Thermoplasmatales;Thermoplasmataceae;Cuniculiplasma;Cuniculiplasma divulgatum
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_flexneri
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_coli
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_dysenteriae
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_coli_D
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_albertii
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_marmotae
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_coli_C
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp005843885
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp000208585
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_fergusonii
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp001660175
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp004211955
root;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia_sp002965065

flextaxd -db test.db --visualise_node Gammaproteobacteria --vis_type newick_vis
                                           __________ Escherichia_sp004211955
                                          |
                                          |__________ Escherichia_coli_C
                                          |
                                          |__________ Escherichia_coli
                                          |
                                          |__________ Escherichia_sp002965065
                                          |
                                          |__________ Escherichia_sp005843885
                                          |
                                          |__________ Escherichia_dysenteriae
                                          |
_ _________ _________ __________ _________|__________ Escherichia_sp000208585
                                          |
                                          |__________ Escherichia_coli_D
                                          |
                                          |__________ Escherichia_sp001660175
                                          |
                                          |__________ Escherichia_fergusonii
                                          |
                                          |__________ Escherichia_albertii
                                          |
                                          |__________ Escherichia_flexneri
                                          |
                                          |__________ Escherichia_marmotae

flextaxd -db test.db --visualise_node Archaea --vis_type newick_vis
                      __________ _________ _________ __________ Bog-38
            _________|
           |         |__________ _________ _________ __________ Methanosarcina
           |
_ _________|                                         __________ Saccharolobus
           |_________ __________ _________ _________|
           |                                        |__________ Sulfolobus
           |
           |_________ __________ _________ _________ __________ Cuniculiplasma

It looks to me though that they simply have the GTDB database but transformed into DADA2 format? If so the easiest solution is to use the GTDB files directly (https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/ar122_taxonomy_r95.tsv and https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/bac120_taxonomy_r95.tsv, to which the QIIME load module in flextaxd has been adapted to?

However, If there is unique data (or changes in the annotations) that is only available in DADA2 format we will neet to do some changes.

One option is to add a DADA2 load module, update faulty annotations from GTDB using a genomeid2taxid file (and the same name as GTDB (minus GB/RF this is stripped from the name) and/or an option to build the GTDB database without the annotations and then add the annotations (with custom annotation) later.

For the kraken2 database build part we will need an update (which I have been working on) that can use one file (instead of one per genome_id) this is however not yet ready for upload. The other option is to split the source file into one file per GCF (with the GCF numer as file name). And then just keep the sequence headers in the file, flextaxd will map the header in each file to that annotation.

A few questions

Are there custom annotations from this site that you need to add to the GTDB taxonomy tree?
Are there incorrect annotations in the GTDB annotations? (otw it doesn´t hurt that they are there)
What is your source for the protozoa, viruses, fungi and chlorophyta genome background?
What version of flextaxd are you using? (flextaxd --version)

HitMonk commented 3 years ago

Hello! Thank you for this. So according to this, the tree file needs to be modified to have "root" term before Archaea and bacteria and that should solve the node error issue? Also, I can split the 16s GTDB file into individual sequences, that shouldnt be an issue.

However, I dont think i can use the whole genomes to build the 16s database. People have tried it before and it seems to not work as well. Here is a link that i came across. So, i need to build a 16s kraken2 database with only 16s genes. This is the DADA2 formatted 16s GTDB single fasta file. I also dont think there are any changes in the annotation compared to the GTDB database.

I have reached out to the kraken2 authors and they said that currently they havent made it compatible with 16s GTDB sequences, and only Greengenes, Silva and RDP are available as compatible 16s databases.

Are there custom annotations from this site that you need to add to the GTDB taxonomy tree? No i dont think there are any custom annotations. As i mentioned, i also wont be using the entire GTDB genome database, only the DADA2 formatted 16s sequences. the only difference in annotations is that there are fewer species but otherwise, they should be very similar.

Are there incorrect annotations in the GTDB annotations? (otw it doesn´t hurt that they are there) Im not sure about this. By incorrect annotation, do you mean new species that arent in the whole genome GTDB annotations? Im sure there are no new species, just fewer species.

What is your source for the protozoa, viruses, fungi and chlorophyta genome background? For the whole genome database, i used GTDB for archeal and bacterial genome data and NCBI genomes for Viruses, protozoa, fungi and chlorophyta. Just to be a little more specific at the expense of being redundant;

I initially built the whole genome database with all the genomes (GTDB+NCBI) -- this works really well with kraken2. However, it gives an error that no reads are found when i try to run bracken on the kraken2 report file -- i will recreate this with new version of conda and a fresh install of kraken2 and bracken.
The other database that i have tried and repeatedly failed to build is the 16s database. This will make use of only the 16s DADA2 formatted sequences which in turn are pulled from the GTDB whole genomes. I have a single fasta file (which i will split into individual fasta files today) and a tree file (to which i will add the term "root" before archaea and bacteria). Ill return with more updates shortly.

What version of flextaxd are you using? (flextaxd --version) custom_taxonomy_databases: version 0.3.5, I also made a clean install this week and updated my conda to 4.9.2. all the posts from the last two days are with the updated conda and newly installed flextad software.

Finally, thank you for being so patient with this! I am not sure where else to turn to. So far, no one has built 16s compatible databases for kraken2 using these sequences, and as I mentioned the makers of kraken2 also arent really focused currently on making it compatible. These are QIIME2 compatible, but i would rather use Kraken2 than QIIME2. Or atleast have the option to use kraken2.

davve2 commented 3 years ago

Are there custom annotations from this site that you need to add to the GTDB taxonomy tree? That is great so you will be able to use the tree directly from GTDB

Are there incorrect annotations in the GTDB annotations? (otw it doesn´t hurt that they are there) That is also good, there is no problems having "additional" nods (or annotations to them for that matter) in the tree since the kraken2 database will only consider nodes with supplied sequence.

FlexTaxD (in default mode) will only use the local sequences supplied in the genome input directory --genomes_path where it matches the annotation "genome_id" from your genomeid2taxid file with -> "genome_id(.fna,fasta,fa)" (where genome_id can be full_name_of_file (for example my_custom_genome_file.fasta) with this match it will create the kraken seq2id.map for all sequences (starting with >) in each file. This means that if you supply a file named GCF_000XXX.fna as the genome name, but it only contains the 16S sequence the database will be built from that sequence only.

the task of the genomeid2taxid file is to add the link between a file with sequences to which node in the kraken2 database it should go.	genome_id	taxid
my_custom_genome_file	Escherichia_albertii

It is for example possible to combine your background with whole genome sequences and limit the Bacteria/Archaea part as 16S sequences only.

What is your source for the protozoa, viruses, fungi and chlorophyta genome background?

This is a slightly different issue? But I think this may be related to this issue https://github.com/FOI-Bioinformatics/flextaxd/issues/16#issuecomment-717608801. I implemented an option to export "--dump --dbprogram bracken" which will add a - in the annotation file. According to the user raising this issue it may solve your bracken problem! If you try this, please let me know if this is still a valid solution!
I will also require 16S only databases in the future, so I will spend some time adding additional functions required to complete this task

What version of flextaxd are you using? (flextaxd --version) Perfect it is just good to know it is the same version so I can expect the same errors to show up locally 👍

No worries and as I mentioned above I will have to create some 16S databases in the future so I´m happy to implement this functionallity into FlexTaxD :)

HitMonk commented 3 years ago

Ok, so i built the files as discussed. I have also attached a file detailing all the commands that i used. I think itll help ensure that we are looking at the same file structures. Now i can build the Flextaxd database without any errors! (Hurray)

However, im not sure if the database is built correctly, because i get this message when i validate it:

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd --database 16sGTDB --validate

2021-02-19 13:47:05,473 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
2021-02-19 13:47:05,637 DatabaseConnection [INFO ]  custom.db opened successfully.
2021-02-19 13:47:05,637 DatabaseConnection [INFO ]  Get all database nodes
2021-02-19 13:47:05,642 DatabaseConnection [INFO ]  Get all database edges
2021-02-19 13:47:05,644 DatabaseConnection [INFO ]  Get all children from root node
2021-02-19 13:47:05,646 DatabaseConnection [INFO ]  Get tree edges from children
2021-02-19 13:47:05,648 DatabaseConnection [INFO ]  Get nodes from tree edges
2021-02-19 13:47:05,648 DatabaseConnection [INFO ]  Validate parents
2021-02-19 13:47:05,650 DatabaseConnection [INFO ]  Tree statistics
                                        Nodes: 0
                                        Links: 0
                                        Tree: n(0), l(0)
                                        LinkNodes: 0
                                        Parent_ok: True

2021-02-19 13:47:05,650 DatabaseConnection [INFO ]  Validation OK!
Validation comleted!

It says 0 nodes and links, is that correct? Shouldnt the nodes and links represent what we have in the tree?

I can see the tree when i built it though:

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd -db 16sGTDB.db --visualise_node Archaea --vis_type newick_vis 

                              ______ ______ QMVU01
                             |
                ______ ______|______ ______ B14-G2
               |             |
               |             |       ______ Methanodesulfokores
               |             |______|
               |                    |______ Korarchaeum
               |
               |                     ______ B3-G16
               |       ______ ______|

Finally, im at a loss on how i go about building the Kraken2 database. It gives me a usage error as follow:

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd --database 16sGTDB.db -o taxonomy/ --genomes_path genomes/ --dbprogram kraken2 --create_db --db_name 16s_GTDB_k2 --skip_download
usage: flextaxd [-h] [-db] [-o] [--dump] [--dump_mini] [--force] [--validate]
                [--stats] [-tf] [-tt] [--taxid_base] [-mf] [-md] [-gt] [-gp]
                [-p] [--replace] [--clean_database] [--dbprogram]
                [--dump_prefix] [--dump_sep] [--dump_descriptions]
                [--visualise_node] [--vis_type] [--vis_depth] [--logs]
                [--verbose] [--debug] [--supress] [--quiet]
flextaxd: error: unrecognized arguments: --create_db --db_name 16s_GTDB_k2 --skip_download

Also here are all the steps that i followed and files that i used.

commands.txt

davve2 commented 3 years ago

Hej, it looks like progress!

For the kraken building part you need to use flextaxd-create (separate program)

--skip_download is removed in version 0.3.5 and by default no downloads are made.

flextaxd --database 16sGTDB --validate Is it possible that the "0" nodes issue is because you did not add the .db in the command pasted in the above comment? Because from the visualisation it looks like 16sGTDB.db does contain nodes?

And thanks for the commands file!

I would suggest adding "uniq" to your tree file

cat tree_2.txt |sort |uniq > tree_2_uniq.txt
wc -l tree_2_uniq.txt
31910 tree_2_uniq.txt
wc -l tree_2.txt
194600 tree_2.txt

HitMonk commented 3 years ago

And that did it! I seem to have a working kraken2 16s GTDB database. It feels surreal and now, looking back, extremely easy to do. Thank you!

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ kraken2-inspect -db 16s_GTDB_k2/
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 33067
# Table size: 0
# Table capacity: 2592182
# Min clear hash value = 0

I do have a few last questions: You were absolutely right, the validation failed because i missed the .db. I had 2 files with the same name, just one with the .db extension and one without. however, the validation is still giving me a little trouble:

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd --database 16sGTDB.db --validate
2021-02-19 14:44:01,813 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
2021-02-19 14:44:01,826 DatabaseConnection [INFO ]  16sGTDB.db opened successfully.
2021-02-19 14:44:01,827 DatabaseConnection [INFO ]  Get all database nodes
2021-02-19 14:44:01,938 DatabaseConnection [INFO ]  Get all database edges
2021-02-19 14:44:02,077 DatabaseConnection [INFO ]  Get all children from root node
2021-02-19 14:44:02,165 DatabaseConnection [INFO ]  Get tree edges from children
2021-02-19 14:44:02,174 DatabaseConnection [INFO ]  Get nodes from tree edges
2021-02-19 14:44:02,200 DatabaseConnection [INFO ]  Validate parents
2021-02-19 14:44:02,314 DatabaseConnection [INFO ]  Tree statistics
                                        Nodes: 44843
                                        Links: 44843
                                        Tree: n(1), l(1)
                                        LinkNodes: 44843
                                        Parent_ok: True

2021-02-19 14:44:02,321 DatabaseConnection [INFO ]  44842
Traceback (most recent call last):
  File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/bin/flextaxd", line 10, in <module>
    sys.exit(main())
  File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/custom_taxonomy_databases.py", line 198, in main
    db.validate_tree()
  File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.6/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 244, in validate_tree
    raise TreeError("The number of nodes and the number nodes under root does not match!")
modules.database.DatabaseConnection.TreeError: 'The number of nodes and the number nodes under root does not match!'

Finally, you also mentioned that i might have to change some flags if i need to build the bracken database. Do i just do something like this:

flextaxd --database 16sGTDB.db --dbprogram bracken-o taxonomy --dump

flextaxd --database 16sGTDB.db -o taxonomy/ --genomes_path genomes/ --dbprogram bracken --create_db --db_name 16s_GTDB_k2 --skip_download

If yes, how do i change the read length flags since bracken creates different databases based on different read lengths?

davve2 commented 3 years ago

Hej that sounds great! Check the kraken2-inspect result to see if you got a valid tree!

I´ve been looking into your commands however and i think there are a few problems (maybe only with the flextaxd database) I will have to look at these next week but it is great to have your data and the commands so I can follow what errors may occur!

The main reason is that the CanSNPer module in this case misses the root connection. (This is why validation fails) ´

For bracken I think it is only the names.dmp and nodes.dmp export that bracken needs when building that database. For kraken it does not seem to matter if you choose kraken2 or bracken export!

HitMonk commented 3 years ago

Oh! it looks like i might have celebrated a little too early. The kraken2-inspect doesnt give me a tree, it just stops after giving me the statistics. I also took a look at the taxonomy file after the dump command. the node.dmp file shows everything as "no rank"

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb/taxonomy$ more nodes.dmp | head
1       |       1       |       no rank |       -       |
2       |       2       |       family  |       -       |
3       |       2       |       no rank |       -       |
4       |       3       |       no rank |       -       |
5       |       4       |       no rank |       -       |
6       |       5       |       no rank |       -       |
7       |       6       |       no rank |       -       |
8       |       7       |       no rank |       -       |
9       |       8       |       no rank |       -       |
10      |       5       |       no rank |       -       |

Ill have the whole genome databases ready by next week and hope to recreate any errors that crop up. Hope you have a great weekend!

davve2 commented 3 years ago

Thanks have a good weekend,

I will have some tests going during the weekend but I will be looking more deeply into it next week!

Yes that is true, the rank will not be correctly assigned from the CanSNPer import function (As CanSNPer mostly is used for levels below species), for qiime it is using g,s I will create a DB from your example and post the pipeline here during next week. I think I will be using the gtdb database directly

davve2 commented 3 years ago

Hej @HitMonk I´m still looking into this issue,

Right now there is a bit of a problem regarding some nodes as for example

GB_GCA_003697105.1      d__Bacteria;p__4572-55;c__4572-55;o__J002;f__J002;g__J002;s__J002 sp003697105
GB_GCA_002084765.1      d__Bacteria;p__4572-55;c__4572-55;o__4572-55;f__4572-55;g__4572-55;s__4572-55

has non unique parents. For the kraken database it does not really matter, but when merging databases and for other programs I need to implement a decent way of keeping the number of replicated parents. In your input file these are not either differentiated with p, o which makes it even more complicated.

I´m looking at a few potential solutions but it will take some more time than I initially planned.

davve2 commented 3 years ago

https://github.com/FOI-Bioinformatics/flextaxd/issues/37 https://github.com/FOI-Bioinformatics/flextaxd/issues/38

davve2 commented 3 years ago

Ok I seem to have found a solution for your problem (at least for the kraken2 database build), you will need to install the latest release 0.3.6

This is what I did

flextaxd -db rRNA_custom_tree -tf taxonomy_unique.txt -tt CanSNPer -o taxonomy --dump --dbprogram kraken2
flextaxd -db rRNA_custom_tree --genomeid2taxid g2tax.txt
flextaxd-create -db rRNA_custom_tree --create_db --db_name krakendb --genomes_path fasta_files --dbprogram kraken2 -o taxonomy --verbose

I´ve only bugtested using --test, but that at least build a valid database. I´m running the complete database to be sure but it usually works if the test function pass.

flextaxd-create -db rRNA_custom_tree.db --db_name krakendb --create_db --verbose --genomes_path data/fasta_files/ --processes 40 -o taxonomy
kraken2-build --build --db test_rRNA_database/krakendb  --threads 40
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.422s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 10415540 bytes
Capacity estimation complete. [0.995s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.
Completed processing of 23091 sequences, 31017889 bp
Writing data to disk...  complete.
Database files completed. [10.321s]
Database construction complete. [Total: 11.815s]
2021-02-26 15:34:47,404 CreateKrakenDatabase [INFO ]  Create inspect file!
Database disk usage: 49M
After cleaning, database uses 13M
2021-02-26 15:34:49,647 CreateKrakenDatabase [INFO ]  kraken2 database created
2021-02-26 15:34:49,647 create_databases [INFO ]  --- Time summary  14 minutes 30.32746696472168 seconds---

natchaphon602 commented 2 years ago

Is it possible to build kraken2 database with latest release of GTDB?

davve2 commented 2 years ago

Dear @natchaphon602,

yes it is possible to build a kraken2 database on the latest release of GTDB. Follow the guidlines here

https://github.com/FOI-Bioinformatics/flextaxd/wiki/Walkthrough--merge-NCBI-and-GTDB

Follow the walkthrough. If you are not interested in anything outside GTDB then skip the first step including the use of the NCBI database.

/David

natchaphon602 commented 2 years ago

You mean skip anything include NCBI database?

davve2 commented 2 years ago

Correct, GTDB only doesn´t require any of the NCBI steps.

Create the flextaxd database in step 2.2 using GTDB. Then export the taxonomy (dbprogram kraken2). Continue with the steps using the flextaxd-build script. (Download genomes either through ncbi-genome-download manually) or using the download command in the wiki.

Let me know if you have any other questions!

/David

natchaphon602 commented 2 years ago

After 2.2 step, I got errors in later step. I'm still not figure it out how to fix them.

davve2 commented 2 years ago

If you could post the command that you are using and the errors here, I can help you to identify the problem.

/David

natchaphon602 commented 2 years ago

flextaxd -db databases/NCBI_GTDB_merge.db --clean_database --verbose --log NCBI_GTDB_merge_log (--taxonomy_type NCBI) -bash: syntax error near unexpected token `('

flextaxd -db databases/NCBI_GTDB_merge.db -md databases/ar122_gtdb.db --parent Archaea --replace --verbose --logs NCBI_GTDB_merge_log 2022-03-30 10:35:32,505 custom_taxonomy_databases [INFO ] FlexTaxD logging initiated! 2022-03-30 10:35:32,505 custom_taxonomy_databases [INFO ] Loading module: ModifyTree 2022-03-30 10:35:32,508 ModifyTree [INFO ] Modify Tree 2022-03-30 10:35:32,509 DatabaseConnection [INFO ] databases/NCBI_GTDB_merge.db opened successfully. 2022-03-30 10:35:32,509 ModifyTree [INFO ] Taxid base updated! 2022-03-30 10:35:32,510 DatabaseConnection [INFO ] databases/ar122_gtdb.db opened successfully. Traceback (most recent call last): File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 796, in get_id res = self.query(QUERY).fetchone()[0] TypeError: 'NoneType' object is not subscriptable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/bin/flextaxd", line 10, in sys.exit(main()) File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/custom_taxonomy_databases.py", line 294, in main modify_obj = modify_module(database=args.database, mod_file=args.mod_file, mod_database= args.mod_database,parent=args.parent,replace=args.replace,taxid_base=args.taxid_base) File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/ModifyTree.py", line 92, in init self.modsource = self.parse_modification(self.moddb,"database") File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/ModifyTree.py", line 255, in parse_modification self.parent_link = self.taxonomydb.get_parent(self.taxonomydb.get_id(self.parent)) File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 799, in get_id raise NameError("Name not found in the database! {name}".format(name=name)) modules.database.DatabaseConnection.NameError: 'Name not found in the database! Archaea'

flextaxd -db databases/NCBI_GTDB_merge.db -md databases/bac120_gtdb.db --parent Bacteria --replace --verbose --logs NCBI_GTDB_merge_log 2022-03-30 10:34:49,936 custom_taxonomy_databases [INFO ] FlexTaxD logging initiated! 2022-03-30 10:34:49,936 custom_taxonomy_databases [INFO ] Loading module: ModifyTree 2022-03-30 10:34:49,939 ModifyTree [INFO ] Modify Tree 2022-03-30 10:34:49,940 DatabaseConnection [INFO ] databases/NCBI_GTDB_merge.db opened successfully. 2022-03-30 10:34:49,940 ModifyTree [INFO ] Taxid base updated! 2022-03-30 10:34:49,942 DatabaseConnection [INFO ] databases/bac120_gtdb.db opened successfully. Traceback (most recent call last): File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 796, in get_id res = self.query(QUERY).fetchone()[0] TypeError: 'NoneType' object is not subscriptable

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/bin/flextaxd", line 10, in sys.exit(main()) File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/custom_taxonomy_databases.py", line 294, in main modify_obj = modify_module(database=args.database, mod_file=args.mod_file, mod_database= args.mod_database,parent=args.parent,replace=args.replace,taxid_base=args.taxid_base) File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/ModifyTree.py", line 92, in init self.modsource = self.parse_modification(self.moddb,"database") File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/ModifyTree.py", line 255, in parse_modification self.parent_link = self.taxonomydb.get_parent(self.taxonomydb.get_id(self.parent)) File "/Users/natchaphonrajudom/anaconda3/envs/flextaxd/lib/python3.10/site-packages/flextaxd/modules/database/DatabaseConnection.py", line 799, in get_id raise NameError("Name not found in the database! {name}".format(name=name)) modules.database.DatabaseConnection.NameError: 'Name not found in the database! Bacteria'

after 2.2 I try those command and these are results

natchaphon602 commented 2 years ago

I try 2.2, then 3, 4 and 5.1 but not success

natchaphon602 commented 2 years ago

Do you understand my problem? sorry, I never construct kraken database before. @davve2

davve2 commented 2 years ago

Hej,

It seems like you have pasted the commands directly into the terminal.

For example flextaxd -db databases/NCBI_GTDB_merge.db --clean_database --verbose --log NCBI_GTDB_merge_log (--taxonomy_type NCBI) -bash: syntax error near unexpected token `('

The command must be without parentesis
flextaxd -db databases/NCBI_GTDB_merge.db --clean_database --verbose --log NCBI_GTDB_merge_log --taxonomy_type NCBI

Since this will effect the downstream commands I think we need to restart the process from here. /David

natchaphon602 commented 2 years ago

I've already fixed about parenthesis, but downstream analysis I got an error in 3. and also 4.

davve2 commented 2 years ago

Hi, sorry for the delay, I was away last week.

Perhaps you could give me each step that is performed and show the result of the verbose output (use --verbose). If possible supply example source files.

Also it would be useful to have the exact environment from conda (export env) then it would be more easy to look for bugs.

/David

natchaphon602 commented 2 years ago

Hi, sorry, I have a question.

How to create dummy databases for fix "modules.database.DatabaseConnection.NameError: 'Name not found in the database! Archaea'"? Or should I fix with step 2.1 and 2.3 in the walkthrough? @davve2

natchaphon602 commented 2 years ago

Hi @davve2

I have a question from my working, I have an issue that if I clean the db (2.3) , I cannot merge both archaea and bacteria. But if I skip clean I can merge only archaea. Do you have something like this before?

davve2 commented 2 years ago

Dear @natchaphon602

I recently rebuild my databases here and also noticed this bug. I think I solved it, I´m currently bugtesting my latest version and hope to soon be able to release it. Try the branch v0.4.3 which has a solution for that problem. Observe that if the two nodes Archaea and Bacteria are empty when cleaning the database the parameter -tt NCBI has to be passed with the clean command.

/David

natchaphon602 commented 2 years ago

So you mean that I have to use earlier version of flextaxd to create my databases?

davve2 commented 2 years ago

No a newer version, it is currently only a branch. If you stand in the tab on github you can see "branches" to the right of "master". Select branch v0.4.3. I hope to get some time next week to bugtest that version and create a new release. But right now I made some changes that could cause unexpected problems (hopefully not) so I want to make sure at least standard procedures are working without problems.

FOI-Bioinformatics / flextaxd

Building Kraken2 compatible GTDB only database #33