FOI-Bioinformatics / flextaxd

FlexTaxD (Flexible Taxonomy Databases) - Create, add, merge different taxonomy sources (QIIME, GTDB, NCBI and more) and create metagenomic databases (kraken2, ganon and more )
GNU General Public License v3.0
65 stars 8 forks source link

Issues with cleaning and merging NCBI+GTDB database #44

Closed HitMonk closed 3 years ago

HitMonk commented 3 years ago

Hello, I was updating my GTDB database to the 202 release and am hitting a few snags. The errors start showing up starting from the section 2.3 Clean the database. They dont show up on the logs which i also have attached with this post (FlexTaxD-Jul-14-2021.log). I have also attached a separate file with all the errors that do crop up (flextaxD_Error.txt). The taxonomy file generated when exported to NCBI format (section 4) is just an empty file.

Any suggestion on working through this is extremely appreciated!

davve2 commented 3 years ago

Hej,

I will have a thorough look at this tomorrow. I could not replicate the issue running through the walkthrough on my system. However I was using a small set of genomes for the test.

Which version of flextaxd are you using?

davve2 commented 3 years ago

Good morning,

I created a large database and could replicate the problem. I have implemented some changes in the intepretation of a link in the database since version 4.1 and this seem to have created an issue with removing nodes from the NCBI database.

However, there is solution, adding the --taxonomy_type NCBI parameter during the clean does not only keep some nodes in the top of the tree, it also uses a less naive algorithm and make sure that from each node it walks up to the root. This version of the clean takes a little longer but it does the job. The main problem is the many extra levels in the taxonomy tree in the NCBI structure for Eukaryotes. Using default settings it won´t walk far enough up the tree to reach the root node.

I added this parameter in the example in the workflow to ensure this problem does not occur during the walkthrough.

In the example above if you did not create the backup database (before the cleaning step) you will have to rebuild the database from scratch. If a copy was made, retrieve the copy and rerun the --clean_database with the additional --taxonomy_type NCBI parameter setting.

HitMonk commented 3 years ago

Hello @davve2 , I did have a pretty big dataset which included fungi and chlorophyta. I think that might have caused the issue. I ran the commands again and they seem to work just fine. Thank you, for your support!