FOI-Bioinformatics / nanometa_live

A streamlined workflow and GUI for real-time species identification and pathogen characterization via nanopore sequencing data. Engineered for precision, speed, and user-friendliness, with offline functionality post-initialization.
GNU General Public License v3.0
15 stars 2 forks source link

Enhance Data Preparation Workflow by Adding Genome Files Existence Check #53

Closed druvus closed 1 year ago

druvus commented 1 year ago

Description:

This PR introduces a new function check_genome_files_existence to verify the presence of genome files in the workdir/data-files/genomes directory before proceeding to fetch species data from GTDB. This is a critical check to avoid redundant data fetching and processing, ensuring that only missing genome files are fetched anew. The changes are primarily located in file_utils.py and nanometa_prepare.py.

Key Changes:

  1. New Functionality:

    • Added check_genome_files_existence function in file_utils.py. This function checks for the existence of genome files in the specified directory and returns a list of species for which genome files are missing.
  2. Workflow Enhancement:

    • Integrated the check_genome_files_existence function within the main function of nanometa_prepare.py. The check is performed before the GTDB data fetching process.
    • If there are missing genome files, the species list is updated to include only those species. This ensures that the subsequent GTDB data fetching and processing steps are performed only for the missing genome files, optimizing the workflow.
  3. Code Refactoring:

    • Refactored the main function to include the check for missing genome files, and adjusted the subsequent GTDB data fetching and genome downloading sections accordingly.
    • Improved code readability by organizing the sections and adding explanatory comments.

These modifications adhere to the project's coding principles by enhancing the efficiency and readability of the data preparation workflow. The new check_genome_files_existence function is a reusable piece of code that can be utilized in other parts of the project as well.