A streamlined workflow and GUI for real-time species identification and pathogen characterization via nanopore sequencing data. Engineered for precision, speed, and user-friendliness, with offline functionality post-initialization.
GNU General Public License v3.0
15
stars
2
forks
source link
Enhance Data Preparation Workflow by Adding Genome Files Existence Check #53
This PR introduces a new function check_genome_files_existence to verify the presence of genome files in the workdir/data-files/genomes directory before proceeding to fetch species data from GTDB. This is a critical check to avoid redundant data fetching and processing, ensuring that only missing genome files are fetched anew. The changes are primarily located in file_utils.py and nanometa_prepare.py.
Key Changes:
New Functionality:
Added check_genome_files_existence function in file_utils.py. This function checks for the existence of genome files in the specified directory and returns a list of species for which genome files are missing.
Workflow Enhancement:
Integrated the check_genome_files_existence function within the main function of nanometa_prepare.py. The check is performed before the GTDB data fetching process.
If there are missing genome files, the species list is updated to include only those species. This ensures that the subsequent GTDB data fetching and processing steps are performed only for the missing genome files, optimizing the workflow.
Code Refactoring:
Refactored the main function to include the check for missing genome files, and adjusted the subsequent GTDB data fetching and genome downloading sections accordingly.
Improved code readability by organizing the sections and adding explanatory comments.
These modifications adhere to the project's coding principles by enhancing the efficiency and readability of the data preparation workflow. The new check_genome_files_existence function is a reusable piece of code that can be utilized in other parts of the project as well.
Description:
This PR introduces a new function
check_genome_files_existence
to verify the presence of genome files in theworkdir/data-files/genomes
directory before proceeding to fetch species data from GTDB. This is a critical check to avoid redundant data fetching and processing, ensuring that only missing genome files are fetched anew. The changes are primarily located infile_utils.py
andnanometa_prepare.py
.Key Changes:
New Functionality:
check_genome_files_existence
function infile_utils.py
. This function checks for the existence of genome files in the specified directory and returns a list of species for which genome files are missing.Workflow Enhancement:
check_genome_files_existence
function within themain
function ofnanometa_prepare.py
. The check is performed before the GTDB data fetching process.Code Refactoring:
main
function to include the check for missing genome files, and adjusted the subsequent GTDB data fetching and genome downloading sections accordingly.These modifications adhere to the project's coding principles by enhancing the efficiency and readability of the data preparation workflow. The new
check_genome_files_existence
function is a reusable piece of code that can be utilized in other parts of the project as well.