GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data
Apache License 2.0
28 stars 17 forks source link
genetics genomics ml open-source

GenoML

Downloads

Updated 07 November 2024: Latest Release on pip! v1.0.1

How to Get Started with GenoML

Introduction

GenoML (Genomics + Machine Learning) is an automated Machine Learning (autoML) for genomics data. In general, use a Linux or Mac with Python >3.5 for best results. This repository and pip package are under active development!

This README is a brief look into how to structure arguments and what arguments are available at each phase for the GenoML CLI.

If you are using GenoML for your own work, please cite the following papers:

Installing + Downloading Example Data

git clone https://github.com/GenoML/genoml2.git

pip install genoml2

OR

pip install genoml2 --upgrade

svn export https://github.com/GenoML/genoml2.git/trunk/examples

Note: When you pip install this package, the examples/ folder is also downloaded! However, if you still want to download the directory and SVN is not pre-installed, you can download it via Homebrew if you have that installed using brew install svn

CHANGELOG

Table of Contents

0. (OPTIONAL) How to Set Up a Virtual Environment via Conda

1. Munging with GenoML

2. Training with GenoML

3. Tuning with GenoML

4. Testing/Validating with GenoML

5. Experimental Features

0. [OPTIONAL] How to Set Up a Virtual Environment via Conda

You can create a virtual environment to run GenoML, if you prefer. If you already have the Anaconda Distribution, this is fairly simple.

To create and activate a virtual environment:

# To create a virtual environment
conda create -n GenoML python=3.7

# To activate a virtual environment
conda activate GenoML

# To install requirements via pip 
pip install -r requirements.txt
    # If issues installing xgboost from requirements - (3 options)
        # use Homebrew to 
            # xcode-select --install
            # brew install gcc@7
        # conda install -c conda-forge xgboost 
        # pip install xgboost==0.90
    # If issues installing umap 
        # pip install umap-learn 

## MISC
# To deactivate the virtual environment
# conda deactivate GenoML

# To delete your virtual environment
# conda env remove -n GenoML

To install the GenoML in the user's path in a virtual environment, you can do the following:

# Install the package at this path
pip install .

# MISC
    # To save out the environment requirements to a .txt file
# pip freeze > requirements.txt

    # Removing a conda virtualenv
# conda remove --name GenoML --all 

1. Munging with GenoML

Munging with GenoML will, at minimum, do the following:

Required arguments for GenoML munging are --prefix and --pheno

Be sure to have your files formatted the same as the examples, key points being:

Note: The following examples are for discrete data, but if you substitute following commands with continuous instead of discrete, you can preprocess your continuous data!

If you would like to munge just with genotypes (in PLINK binary format), the simplest command is the following:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv

If you would like to control the pruning stringency in genotypes:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--r2_cutoff 0.3 \
--pheno examples/discrete/training_pheno.csv

You can choose to skip pruning your SNPs at this stage by changing the --skip_prune flag to "yes" (default is "no")

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--skip_prune yes \
--pheno examples/discrete/training_pheno.csv

You can choose to impute on mean or median by modifying the --impute flag, like so (default is median):

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file and specifying impute

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--impute mean

If you suspect collinear variables, and think this will be a problem for training the model moving forward, you can use variance inflation factor (VIF) filtering:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file while using VIF to remove multicollinearity 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--vif 5 \
--iter 1

Well, what if you had GWAS summary statistics handy, and would like to just use the same SNPs outlined in that file? You can do so by running the following:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and a GWAS summary statistics file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csv

Note: When using the GWAS flag, the PLINK binaries will be pruned to include matching SNPs to the GWAS file.

...and if you wanted to add a p-value cut-off...

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and a GWAS summary statistics file with a p-value cut-off 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--gwas examples/discrete/example_GWAS.csv \
--p 0.01

Do you have additional data you would like to incorporate? Perhaps clinical, demographic, or transcriptomics data? If coded and all numerical, these can be added as an --addit file by doing the following:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and an addit file

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv

You also have the option of not using PLINK binary files if you would like to just preprocess (and then, later train) on a phenotype and addit file by doing the following:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and an addit file

genoml discrete supervised munge \
--prefix outputs \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv

Are you interested in selecting and ranking your features? If so, you can use the --feature_selection flag and specify like so...:

# Running GenoML munging on discrete data using PLINK genotype binary files, a phenotype file, and running feature selection 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--feature_selection 50

The --feature_selection flag uses extraTrees (classifier for discrete data; regressor for continuous data) to output a *.approx_feature_importance.txt file with the features most contributing to your model at the top.

Do you have additional covariates and confounders you would like to adjust for in the munging step prior to training your model and/or would like to reduce your data? To adjust, use the --adjust_data flag with the following necessary flags:

To reduce your data prior to adjusting, use the --umap_reduce yes flag. This flag will also prompt you for if you want to adjust your data, normalize, and what your target features and confounders might be. We use the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) to reduce your data into 2D, adjust, and exports a plot and an adjusted dataframe moving forward. This can be done by running the following:

# Running GenoML munging on discreate data using PLINK binary files, a phenotype file, using UMAP to reduce dimensions and account for features, and running feature selection

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--addit examples/discrete/training_addit.csv \
--umap_reduce yes \
--adjust_data yes \
--adjust_normalize yes \
--target_features examples/discrete/to_adjust.txt \
--confounders examples/discrete/training_addit_confounder_example.csv \
--feature_selection 50 

Here, the --confounders flag takes in a dataset of features that should be accounted for. This is a .csv file with the ID column and header included and is numeric with no missing data. The ID column is mandatory. The --target_features flag takes in a .txt with a list of features (column names) you are adjusting for.

2. Training with GenoML

Training with GenoML competes a number of different algorithms and outputs the best algorithm based on a specific metric that can be tweaked using the --metric_max flag (default is AUC).

Required arguments for GenoML are the following:

The most basic command to train your model looks like the following, it looks for the dataForML file that was generated in the munging step:

# Running GenoML supervised training after munging on discrete data

genoml discrete supervised train \
--prefix outputs

If you would like to determine the best competing algorithm by something other than the AUC, you can do so by changing the --metric_max flag (Options include AUC, Balanced_Accuracy, Sensitivity, and Specificity) :

# Running GenoML supervised training after munging on discrete data and specifying the metric to maximize by 

genoml discrete supervised train \
--prefix outputs \
--metric_max Sensitivity

Note: The --metric_max flag is only available for discrete datasets.

3. Tuning with GenoML

The most basic command to tune your model looks like the following, it looks for the file that was generated in the training step:

# Running GenoML supervised tuning after munging and training on discrete data

genoml discrete supervised tune \
--prefix outputs

If you are interested in changing the number of iterations the tuning process goes through by modifying --max_tune (default is 50), or the number of cross-validations by modifying --n_cv (default is 5), this is what the command would look like:

# Running GenoML supervised tuning after munging and training on discrete data, modifying the number of iterations and cross-validations 

genoml discrete supervised tune \
--prefix outputs \
--max_tune 10 --n_cv 3

If you are interested in tuning on another metric other than AUC (default is AUC), you can modify --metric_tune (options are AUC or Balanced_Accuracy) by doing the following:

# Running GenoML supervised tuning after munging and training on discrete data, modifying the metric to tune by

genoml discrete supervised tune \
--prefix outputs \
--metric_tune Balanced_Accuracy

4. Testing/Validation with GenoML

In order to properly test how your model performs on a dataset it's never seen before (but you start with different PLINK binaries), we have created the harmonization step that will:

  1. Keep only the same SNPs between your reference dataset and the dataset you are using for validation
  2. Force the reference alleles in the validation dataset to match your reference dataset
  3. Export a .txt file with the column names from your reference dataset to later use in the munging of your validation dataset

Using GenoML for both your reference dataset and then your validation dataset, the process will look like the following:

  1. Munge and train your first dataset
    • That will be your “reference” model
  2. Use the outputs of step 1's munge for your reference model to harmonize your incoming validation dataset
  3. Run through harmonization step with your validation dataset
  4. Run through munging with your newly harmonized dataset
  5. Retrain your reference model with only the matching columns of your unseen data
    • Given the nature of ML algorithms, you cannot test a model on a set of data that does not have identical features
  6. Test your newly retrained reference model on the unseen data

Harmonizing your Validation/Test Dataset

Required arguments for harmonizing with GenoML are the following:

To harmonize your incoming validation dataset to match the SNPs and alleles to your reference dataset, the command would look like the following:

# Running GenoML harmonize

genoml harmonize \
--test_geno_prefix examples/discrete/validation \
--test_prefix outputs \
--ref_model_prefix outputs \
--training_snps_alleles outputs/Munge/variants_and_alleles.tab

This step will generate:

Now that you have harmonized your validation dataset to your reference dataset, you can now munge using a command similar to the following:

# Running GenoML munge after GenoML harmonize

genoml discrete supervised munge 
--prefix outputs \
--geno outputs/Harmonize/refSNPs_andAlleles \
--pheno examples/discrete/validation_pheno.csv \
--addit examples/discrete/validation_addit.csv \
--ref_cols_harmonize outputs/Harmonize/refColsHarmonize_toKeep.txt

All munging options discussed above are available at this step, the only difference here is you will add the --ref_cols_harmonize flag to include the refColsHarmonize_toKeep.txt file generated at the end of harmonizing to only keep the same columns that the reference dataset had.

After munging and training your reference model and harmonizing and munging your unseen test data, you will retrain your reference model to include only matching features. Given the nature of ML algorithms, you cannot test a model on a set of data that does not have identical features.

To retrain your model appropriately, after munging your test data with the --ref_cols_harmonize flag, a final columns list will be generated at outputs/Munge/finalHarmonizedCols_toKeep.txt. This includes all the features that match between your unseen test data and your reference model. Use the --matching_columns flag when retraining your reference model to use the appropriate features.

When retraining of the reference model is complete, you are ready to test!

A step-by-step guide on how to achieve this is listed below:

# 0. MUNGE THE REFERENCE DATASET
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv
# Files made: 
    # outputs/Munge/dataForML.h5
    # outputs/Munge/list_features.txt
    # outputs/Munge/variants_and_alleles.tab

# 1. TRAIN THE REFERENCE DATASET
genoml discrete supervised train \
--prefix outputs
# Files made: 
    # outputs/Train/best_algorithm.txt
    # outputs/Train/trainedModel.joblib
    # outputs/Train/trainedModel_trainingSample_Predictions.csv
    # outputs/Train/trainedModel_withheldSample_Predictions.csv
    # outputs/Train/trainedModel_withheldSample_ROC.png
    # outputs/Train/trainedModel_withheldSample_probabilities.png
    # outputs/Train/training_withheldSamples_performanceMetrics.csv

# 2. HARMONIZE TEST DATASET IF USING PLINK/GENOTYPES
genoml harmonize \
--test_geno_prefix examples/discrete/validation \
--test_prefix outputs \
--ref_model_prefix outputs \
--training_snps_alleles outputs/Harmonize/variants_and_alleles.tab
# Files made: 
    # outputs/Harmonize/refColsHarmonize_toKeep.txt
    # outputs/Harmonize/refSNPs_andAlleles.bed
    # outputs/Harmonize/refSNPs_andAlleles.bim
    # outputs/Harmonize/refSNPs_andAlleles.fam

# 3. MUNGE THE TEST DATASET ON REFERENCE MODEL COLUMNS
genoml discrete supervised munge \
--prefix outputs \
--geno outputs/Harmonize/refSNPs_andAlleles \
--pheno examples/discrete/validation_pheno.csv \
--addit examples/discrete/validation_addit.csv \
--ref_cols_harmonize outputs/Harmonize/refColsHarmonize_toKeep.txt
# Files made: 
    # outputs/Munge/finalHarmonizedCols_toKeep.txt
    # outputs/Munge/list_features.txt
    # outputs/Munge/variants_and_alleles.tab
    # outputs/Munge/dataForML.h5

# 4. RETRAIN REFERENCE MODEL ON INTERSECTING COLUMNS BETWEEN REFERENCE AND TEST
genoml discrete supervised train \
--prefix outputs \
--matching_columns outputs/Munge/finalHarmonizedCols_toKeep.txt
# Note: This replaces the trained model you made in step 1! 
# Files made: 
    # outputs/Train/best_algorithm.txt
    # outputs/Train/trainedModel.joblib
    # outputs/Train/trainedModel_trainingSample_Predictions.csv
    # outputs/Train/trainedModel_withheldSample_Predictions.csv
    # outputs/Train/trainedModel_withheldSample_ROC.png
    # outputs/Train/trainedModel_withheldSample_probabilities.png
    # outputs/Train/training_withheldSamples_performanceMetrics.csv

# OPTIONAL: TUNING YOUR RETRAINED REFERENCE MODEL ON INTERSECTING COLUMNS BETWEEN REFERENCE AND TEST
genoml discrete supervised tune \
--prefix outputs \
--matching_columns outputs/Munge/finalHarmonizedCols_toKeep.txt

# 5. TEST RETRAINED REFERENCE MODEL OR TUNED MODEL ON UNSEEN DATA
genoml discrete supervised test \
--prefix outputs \
--test_prefix outputs \
--ref_model_prefix outputs/Train/trainedModel
    # If testing a tuned model, change path from `*/Train/trainedModel` to `*/Tune/tunedModel`
# Files made: 
    # outputs/Test/testedModel_allSample_predictions.csv
    # outputs/Test/testedModel_allSample_probabilities.png
    # outputs/Test/testedModel_allSample_ROC.png
    # outputs/Test/testedModel_allSamples_performanceMetrics.csv

Note: When munging the test dataset on the reference model columns using the --ref_cols_harmonize, be sure not to include the --feature_selection flag, as you have already specified the columns to keep moving forward.

5. Experimental Features

UNDER ACTIVE DEVELOPMENT

Planned experimental features include, but are not limited to: