FOI-Bioinformatics / nanometa_live

A streamlined workflow and GUI for real-time species identification and pathogen characterization via nanopore sequencing data. Engineered for precision, speed, and user-friendliness, with offline functionality post-initialization.
GNU General Public License v3.0
15 stars 2 forks source link

Refactor BLAST Database Building and Checking #47

Closed druvus closed 1 year ago

druvus commented 1 year ago

Description:

This PR introduces several changes to improve the efficiency and maintainability of the code related to BLAST database management. The main goal is to build only missing BLAST databases and enhance logging for better traceability.

Changes:
  1. build_blast_databases Function Update:

    • The function now accepts an additional argument missing_databases, which is a list of Tax IDs for which BLAST databases need to be built.
    • Added logic to skip the building of BLAST databases if they already exist or if missing_databases is empty.
    • Enhanced logging to include Tax IDs.
  2. New check_blast_dbs_exist Function:

    • This function checks for the existence of BLAST databases based on a given dictionary that maps species names to their Tax IDs.
    • Returns a list of missing databases by Tax ID.
    • Enhanced logging to include species names and Tax IDs.
  3. Main Script (nanometa_prepare.py) Update:

    • The main script now first checks for missing BLAST databases using check_blast_dbs_exist and then passes this information to build_blast_databases.

By implementing these changes, we ensure that only the necessary BLAST databases are built, thereby saving computational resources. Additionally, the enhanced logging will assist in debugging and traceability.