Uses statistical models trained on data for phylogenetically diverse bacterial and archaeal isolates to predict:
Condition | Units |
---|---|
Oxygen | "Tolerant" or "Not tolerant", classification and probability |
Temperature | Optimum, minimum, and maximum in Celsius |
Salinity | Optimum, minimum, and maximum in % w/v NaCl |
pH | Optimum, minimum, and maximum in units pH |
Reference:
Predicting microbial growth conditions from amino acid composition. Tyler P. Barnum, Alexander Crits-Christoph, Michael Molla, Paul Carini, Henry H. Lee, Nili Ostrov. bioRxiv 2024.03.22.586313; doi: https://doi.org/10.1101/2024.03.22.586313
Clone the repo, create a virtual environment, then install the package and its requirements:
git clone https://github.com/cultivarium/GenomeSPOT.git
cd GenomeSPOT
pip install .
pip install -r requirements.txt
Requirements:
Recommended:
Runtime: ~5-10 seconds per genome
python -m genome_spot.genome_spot --models models \
--contigs tests/test_data/GCA_000172155.1_ASM17215v1_genomic.fna.gz \
--proteins tests/test_data/GCA_000172155.1_ASM17215v1_protein.faa.gz \
--output GCA_000172155.1
Hint: if you only have a genome and need a protein FASTA, use prodigal.
gunzip genome.fna.gz # requires unzip
prodigal -i genome.fna -a protein.faa # get proteins
gzip genome.fna
Each prediction (e.g. optimum temperature) has:
Here is the output for the test genome:
value error is_novel warning units
temperature_optimum 22.953768 6.482357 False None C
temperature_max 31.301471 6.199418 False None C
temperature_min 5.645504 6.329401 False None C
ph_optimum 7.070681 0.909382 False None pH
ph_max 8.993682 1.306915 False None pH
ph_min 5.449215 0.923962 False None pH
salinity_optimum 0.200371 1.935106 False None % w/v NaCl
salinity_max 3.119676 2.361246 False None % w/v NaCl
salinity_min 0 1.182744 False min_exceeded % w/v NaCl
oxygen tolerant 0.974255 False None probability
tutorial.ipynb
provides an interactive demonstration of modules in this repo. Briefly:
Genome
using the classes Protein
and DNA
Predictions can be inaccurate:
Our models were built using phylogenetic balancing and partitioning to improve accuracy:
If you have made predictions for many genomes and want results in a single table, a helper script is provided to join predictions for multiple genomes into a single TSV (all.predictions.tsv
). The first and second rows of the table indicate, respectively, the growth condition being predicted (e.g. temperature_optimum
) and the type of data in the column (e.g. error
). All other rows contain data per genome.
python3 -m genome_spot.join_outputs --dir data/predictions --write-to-tsv
If you have tens of thousands of genomes and no convenient resources for parallelizing computing tasks, a simple option is to use a shell command pwait x
to perform x processes at once (reference). The below example runs 10 jobs at once.
# Define pwait
function pwait() {
while [ $(jobs -p | wc -l) -ge $1 ]; do
sleep 1
done
}
# Run in parallel
INDIR='data/features'
OUTDIR='data/predictions'
for FEATURES_JSON in `ls $INDIR`;
do
PREFIX=$(echo $FEATURES_JSON | cut -d. -f1);
echo $FEATURES_JSON $PREFIX;
python -m genome_spot.genome_spot --models models --genome-features $INDIR/$FEATURES_JSON --output $OUTDIR/$PREFIX > temp.txt &;
pwait 10
done
If you are interested in replicating this work, you can use the provided modules and scientific notebooks. Note that the total workflow may involve several days of runtime. python version 3.8.16
was used for the publication.
Data is downloaded from two resources:
Runtime: 4-8 hours.
# Create directory structure
mkdir data
mkdir data/training_data
mkdir data/genomes
mkdir data/references
# Download BacDive data
BACDIVE_USERNAME=my_username
BACDIVE_PASSWORD=my_password
MAX_BACDIVE_ID=171000 # UPDATE THIS OVER TIME!!!
python3 -m genome_spot.model_training.download_trait_data -u $BACDIVE_USERNAME -p $BACDIVE_PASSWORD \
--max $MAX_BACDIVE_ID \
-b data/training_data/bacdive_data.json \
-o data/training_data/trait_data.tsv \
-a genbank_accessions.txt
# Download genomes using
# list created by above function
ncbi-genome-download -s genbank -F 'fasta,protein-fasta' -o data/genomes -A genbank_accessions.txt 'bacteria,archaea'
# Download GTDB metadata to provide taxonomy
# needed for modeling correctly
wget https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tsv.gz
mv *.tsv.gz data/references/.
gunzip data/references/*tsg.gz
Measure features from genomes and join them with the target variables to be predicted - i.e. trait data from BacDive - to create the training dataset.
Runtime: roughly two hours for ~10-20k genomes.
mkdir ./data/training_data/genome_features/
python3 -m genome_spot.model_training.make_training_dataset -p 7 \
--genomes-directory ./data/genomes/ \
-sfna .fna.gz -sfaa .faa.gz \
--features-directory ./data/training_data/genome_features/ \
--downloaded-traits ./data/training_data/trait_data.tsv \
--tsv-output ./data/training_data/training_data.tsv
The training dataset should be balanced phylogenetically and partitioned into different sets of taxa for testing and training.
The above script performs an automated curation step. Only genomes for which the use_<condition>
flag is True
are used further. You may want to perform additional curation to remove suspect values, as we did for curation. No further curation occurs.
The function make_holdout_sets
performs phylogenetic balancing and partitioning for test and cross-validation sets:
<path_to_holdouts>/train_set_<condition>.txt
<path_to_holdouts>/<condition>_cv_sets.json
, keyed by rank and in list of tuples of (training_indices, validation_indices)
.Runtime: 1-2 minutes.
# Create train, test, and cross-validation sets
python3 -m genome_spot.model_training.make_holdout_sets --training_data_filename data/training_data/training_data.tsv --path_to_holdouts data/holdouts/ --overwrite
In the preprint, we describe testing different sets of features (e.g. only amino acid frequencies) with different types of models (e.g. least squares, ridge, and lasso linear regressions). To reproduce that workflow, you can run the script run_model_selection.py
. The sets of features and estimators are hard-coded into the script run_model_selection
, so you will be unable to try new sets of features or models without modifying code.
Runtime: 10-20 minutes.
python3 -m genome_spot.model_training.run_model_selection --training_data_filename data/training_data/training_data.tsv --path_to_holdouts data/holdouts --outdir data/model_selection/
The script saves models and statistics to the output directory. The statistics can be assessed and used to select a model using the notebook evaluate_models.ipynb
Model selection should result in a set of instructions specifying the features and estimator to be used for each condition. The features are specified by a list of features, and the estimator is specified by a link to the model saved by the above script. These instructions should be saved in a file in the directory where models will be placed, like the existing models/instructions.json
file. Importantly, the 'pipeline_filename' specified in the instructions.json
must be a path that is interpretable when running the below script train_models
.
With a selected model specific by models/instructions.json
, you can produce the final versions of each model. To be as representative as possible, these models are trained on data from both the training and test sets (both of which are phylogenetically balanced). For example, if the pipeline filename is './data/model_selection/temperature_optimum_features1_pipeline17-lasso.joblib'
, then train_models
must be run from where from data
is an immediate subdirectory.
Runtime: 1-2 minutes
python3 -m genome_spot.model_training.train_models --training_data_filename data/training_data/training_data.tsv --path_to_models models --path_to_holdouts data/holdouts
Scientific notebooks provided in notebooks
assist with reproducing analyses in the preprint.
Some analyses involved genomes from the Genome Taxonomy Database and JGI Genomic catalog of Earth’s Microbiomes (GEM) catalogue, which can be obtained from these repositories:
To run unit tests:
pytest -v -s tests/