cultivarium / GenomeSPOT

Predict oxygen, temperature, salinity, and pH preferences of bacteria and archaea from a genome
https://cultivarium.org/
MIT License
27 stars 1 forks source link

GenomeSPOT: Genome-based Salinity, pH, Oxygen Tolerance, and Temperature for Bacteria and Archaea

Uses statistical models trained on data for phylogenetically diverse bacterial and archaeal isolates to predict:

Condition Units
Oxygen "Tolerant" or "Not tolerant", classification and probability
Temperature Optimum, minimum, and maximum in Celsius
Salinity Optimum, minimum, and maximum in % w/v NaCl
pH Optimum, minimum, and maximum in units pH

Reference:

Predicting microbial growth conditions from amino acid composition. Tyler P. Barnum, Alexander Crits-Christoph, Michael Molla, Paul Carini, Henry H. Lee, Nili Ostrov. bioRxiv 2024.03.22.586313; doi: https://doi.org/10.1101/2024.03.22.586313

Quick start

1. Install package

Clone the repo, create a virtual environment, then install the package and its requirements:

git clone https://github.com/cultivarium/GenomeSPOT.git
cd GenomeSPOT
pip install .
pip install -r requirements.txt

Requirements:

Recommended:

2. Run prediction

Runtime: ~5-10 seconds per genome

python -m genome_spot.genome_spot --models models \
    --contigs tests/test_data/GCA_000172155.1_ASM17215v1_genomic.fna.gz \
    --proteins tests/test_data/GCA_000172155.1_ASM17215v1_protein.faa.gz \
    --output GCA_000172155.1

Hint: if you only have a genome and need a protein FASTA, use prodigal.

gunzip genome.fna.gz # requires unzip
prodigal -i genome.fna -a protein.faa # get proteins
gzip genome.fna

3. Interpret output

Each prediction (e.g. optimum temperature) has:

Here is the output for the test genome:

                         value     error is_novel       warning        units
temperature_optimum  22.953768  6.482357    False          None            C
temperature_max      31.301471  6.199418    False          None            C
temperature_min       5.645504  6.329401    False          None            C
ph_optimum            7.070681  0.909382    False          None           pH
ph_max                8.993682  1.306915    False          None           pH
ph_min                5.449215  0.923962    False          None           pH
salinity_optimum      0.200371  1.935106    False          None   % w/v NaCl
salinity_max          3.119676  2.361246    False          None   % w/v NaCl
salinity_min                 0  1.182744    False  min_exceeded   % w/v NaCl
oxygen                tolerant  0.974255    False          None  probability

Tutorial

tutorial.ipynb provides an interactive demonstration of modules in this repo. Briefly:

Key considerations

Predictions can be inaccurate:

Our models were built using phylogenetic balancing and partitioning to improve accuracy:

How to use with multiple genomes

If you have made predictions for many genomes and want results in a single table, a helper script is provided to join predictions for multiple genomes into a single TSV (all.predictions.tsv). The first and second rows of the table indicate, respectively, the growth condition being predicted (e.g. temperature_optimum) and the type of data in the column (e.g. error). All other rows contain data per genome.

python3 -m genome_spot.join_outputs --dir data/predictions --write-to-tsv

If you have tens of thousands of genomes and no convenient resources for parallelizing computing tasks, a simple option is to use a shell command pwait x to perform x processes at once (reference). The below example runs 10 jobs at once.

# Define pwait
function pwait() {
    while [ $(jobs -p | wc -l) -ge $1 ]; do
        sleep 1
    done
}

# Run in parallel
INDIR='data/features'
OUTDIR='data/predictions'
for FEATURES_JSON in `ls $INDIR`; 
do
PREFIX=$(echo $FEATURES_JSON | cut -d. -f1);
echo $FEATURES_JSON $PREFIX;
python -m genome_spot.genome_spot --models models --genome-features $INDIR/$FEATURES_JSON --output $OUTDIR/$PREFIX > temp.txt &;
pwait 10
done

How to repeat model training and evaluation

If you are interested in replicating this work, you can use the provided modules and scientific notebooks. Note that the total workflow may involve several days of runtime. python version 3.8.16 was used for the publication.

1. Download data for training

Data is downloaded from two resources:

Runtime: 4-8 hours.

# Create directory structure
mkdir data
mkdir data/training_data
mkdir data/genomes
mkdir data/references

# Download BacDive data
BACDIVE_USERNAME=my_username
BACDIVE_PASSWORD=my_password
MAX_BACDIVE_ID=171000 # UPDATE THIS OVER TIME!!!
python3 -m genome_spot.model_training.download_trait_data -u $BACDIVE_USERNAME -p $BACDIVE_PASSWORD \
    --max $MAX_BACDIVE_ID \
    -b data/training_data/bacdive_data.json \
    -o data/training_data/trait_data.tsv \
    -a genbank_accessions.txt

# Download genomes using
# list created by above function
ncbi-genome-download -s genbank -F 'fasta,protein-fasta' -o data/genomes -A genbank_accessions.txt 'bacteria,archaea'

# Download GTDB metadata to provide taxonomy
# needed for modeling correctly
wget https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tsv.gz
mv *.tsv.gz data/references/.
gunzip data/references/*tsg.gz

2. Generate training dataframe

Measure features from genomes and join them with the target variables to be predicted - i.e. trait data from BacDive - to create the training dataset.

Runtime: roughly two hours for ~10-20k genomes.

mkdir ./data/training_data/genome_features/
python3 -m genome_spot.model_training.make_training_dataset -p 7 \
    --genomes-directory ./data/genomes/ \
    -sfna .fna.gz -sfaa .faa.gz \
    --features-directory ./data/training_data/genome_features/ \
    --downloaded-traits ./data/training_data/trait_data.tsv \
    --tsv-output ./data/training_data/training_data.tsv

3. Create train, test, and cross-validation sets

The training dataset should be balanced phylogenetically and partitioned into different sets of taxa for testing and training.

The above script performs an automated curation step. Only genomes for which the use_<condition> flag is True are used further. You may want to perform additional curation to remove suspect values, as we did for curation. No further curation occurs.

The function make_holdout_sets performs phylogenetic balancing and partitioning for test and cross-validation sets:

  1. Genomes are balanced and partitioned at the family level into training and test sets for each condition being predicted. Genome accessions are recorded in files like <path_to_holdouts>/train_set_<condition>.txt
  2. Genomes in the training set are further divided for cross-validation. In each fold of a cross-validation, a different set of genomes are held out of model training and used to score the model. The default script performs 5-fold cross-validation, for each rank in phylum, class, order, family, genus, and species. Genome accessions are stored in <path_to_holdouts>/<condition>_cv_sets.json, keyed by rank and in list of tuples of (training_indices, validation_indices).

Runtime: 1-2 minutes.

# Create train, test, and cross-validation sets
python3 -m genome_spot.model_training.make_holdout_sets --training_data_filename  data/training_data/training_data.tsv --path_to_holdouts data/holdouts/ --overwrite

4. Train a variety of models and select the best model

In the preprint, we describe testing different sets of features (e.g. only amino acid frequencies) with different types of models (e.g. least squares, ridge, and lasso linear regressions). To reproduce that workflow, you can run the script run_model_selection.py. The sets of features and estimators are hard-coded into the script run_model_selection, so you will be unable to try new sets of features or models without modifying code.

Runtime: 10-20 minutes.

python3 -m genome_spot.model_training.run_model_selection --training_data_filename data/training_data/training_data.tsv --path_to_holdouts data/holdouts --outdir data/model_selection/

The script saves models and statistics to the output directory. The statistics can be assessed and used to select a model using the notebook evaluate_models.ipynb

Model selection should result in a set of instructions specifying the features and estimator to be used for each condition. The features are specified by a list of features, and the estimator is specified by a link to the model saved by the above script. These instructions should be saved in a file in the directory where models will be placed, like the existing models/instructions.json file. Importantly, the 'pipeline_filename' specified in the instructions.json must be a path that is interpretable when running the below script train_models.

5. Produce the final model using all available data

With a selected model specific by models/instructions.json, you can produce the final versions of each model. To be as representative as possible, these models are trained on data from both the training and test sets (both of which are phylogenetically balanced). For example, if the pipeline filename is './data/model_selection/temperature_optimum_features1_pipeline17-lasso.joblib', then train_models must be run from where from data is an immediate subdirectory.

Runtime: 1-2 minutes

python3 -m genome_spot.model_training.train_models --training_data_filename data/training_data/training_data.tsv --path_to_models models --path_to_holdouts data/holdouts

6. Perform analyses described in preprint

Scientific notebooks provided in notebooks assist with reproducing analyses in the preprint.

Some analyses involved genomes from the Genome Taxonomy Database and JGI Genomic catalog of Earth’s Microbiomes (GEM) catalogue, which can be obtained from these repositories:

Unit Tests

To run unit tests:

pytest -v -s tests/