Matteopaluh / KEMET

KEGG Module Evaluation Tool
Other
23 stars 5 forks source link

add_taxonomy_from_gtdb-tk.py - help! #23

Open aksha19n opened 1 month ago

aksha19n commented 1 month ago

I am trying to run this script but it keeps returning with this "The genomes.instruction file has been updated with 0 genome(s) taxonomy indications, using '.fasta' extension" Could you please tell me if there is anything that I can do to fix it ?

Matteopaluh commented 1 month ago

Hello!

To properly reply I'd need a little more informations, such as:

Best, Matteo

aksha19n commented 1 month ago

Hi Matteo,

I installed KEMET on a UNIX system through conda and ran the script add_taxonomy_from_gtdb-tk.py I ran my genomes through the classify microbes with GTDB-Tk-v2 3.2 workflow available on Kbase. The output files from that were used to run the gtdb to ncbi majority vote script which provided me with a .tsv file containing id no, GTDB classification and NCBI classification. I ensured that the sample/id names are same on the .tsv file and the genomes.instruction file prior to running the add taxonomy script.

Hope this helps

Thank you!

Matteopaluh commented 1 month ago

Thanks for the extra details!

I've only tested the script from input obtained with gtdb-tk command line (so a difference could arise from that aspect). Same goes for the gtdb-to-ncbi script, which depends on a specific version of the GTDB database.. Right now the add_taxonomy_from_gtdb-tk.py script used to work for the 2022 "GTDB R07-RS207" release, as well as 2022 NCBI taxonomy.

I'm not excluding that major changes in taxonomy could have actually happened (I remember some changes regarding Firmicutes to Bacillota maybe?). - This would require fixing the correspondance from NCBI to KEGG BRITE taxonomy.

Else my suspect would be regarding the file extensions of your genomes/MAGs files (whether it was .fasta, .fa, .fna, as it is required from the script in object and specified through the -f argument when running it.

Best regards, Matteo

aksha19n commented 1 month ago

Hi Matteo,

Thank you! The file extensions and names match in the genomes.instruction file and the output file from GTDB. I downloaded the metadata files for r207 and ran the gtdb to ncbi script and used the output file from that to run the add_taxonomy and it worked. However, when i ran the kemet.py code i ran into an error
File "kemet.py", line 781, in taxonomy_filter for line in v[i_start+1:]: UnboundLocalError: local variable 'i_start' referenced before assignment

Could you kindly guide me with this error?

Matteopaluh commented 1 month ago

Hi Matteo, Thank you! The file extensions and names match in the genomes.instruction file and the output file from GTDB. I downloaded the metadata files for r207 and ran the gtdb to ncbi script and used the output file from that to run the add_taxonomy and it worked.

Nice to know! Could you specify what you did precisely? This could serve as a temporary fix until I modify a few things 🙃

Right now I've seen that KEGG BRITE was updated to reflect the changes in the NCBI taxonomy as expected, therfore it will take a couple checks to bring the add_taxonomy script up-to-date.

However, when i ran the kemet.py code i ran into an error File "kemet.py", line 781, in taxonomy_filter for line in v[i_start+1:]: UnboundLocalError: local variable 'i_start' referenced before assignment

Could you kindly guide me with this error?

Do you have the KEGG BRITE file br08601.keg in your working folder? This should be downloaded automatically when setting the working folder via the set_kemet_working-directory.py script.

If not, the file should be there. Else, I'll need to check if that file is still formatted in the way it was in 2022.

Best regards, Matteo