FOI-Bioinformatics / nanometa_live

A streamlined workflow and GUI for real-time species identification and pathogen characterization via nanopore sequencing data. Engineered for precision, speed, and user-friendliness, with offline functionality post-initialization.
GNU General Public License v3.0
14 stars 2 forks source link

Enhancements to `file_utils.py` and `transform_utils.py` to implement mode gtdb-file #51

Closed druvus closed 11 months ago

druvus commented 11 months ago

Summary:

This PR introduces several key enhancements to the file_utils.py and transform_utils.py files to better handle GTDB metadata. The aim is to streamline the GTDB file downloading, reading, and processing steps, particularly when used in conjunction with other modes in nanometa_prepare.py.

Changes:

  1. Download GTDB Metadata: Implemented a function download_gtdb_metadata() to download GTDB metadata files and corrected the directory paths.

  2. Read and Process GTDB Metadata: Added a function read_and_process_gtdb_metadata() to read the GTDB metadata file, filter it based on Kraken2 taxonomy, and return a DataFrame. The function also includes additional logging to show the number of rows before and after filtering.

  3. Adding Tax IDs: Introduced add_taxid_to_results() in transform_utils.py to map species names to tax IDs and add this information as a new column in the DataFrame.

  4. Main Script Update: Included calls to these new functions in nanometa_prepare.py to integrate them into the existing workflow.