JDACS4C-IMPROVE / LGBM

Apache License 2.0
0 stars 0 forks source link

LGBM

This repository demonstrates how to use the IMPROVE library v0.1.0 for building a drug response prediction (DRP) model using LightGBM (LGBM), and provides examples with the benchmark cross-study analysis (CSA) dataset.

This version, tagged as v0.1.0-2024-09-27, introduces a new API which is designed to encourage broader adoption of IMPROVE and its curated models by the research community.

Dependencies

Installation instructions are detialed below in Step-by-step instructions.

Conda yml file conda_wo_candle.yml

ML framework:

IMPROVE dependencies:

Dataset

Benchmark data for cross-study analysis (CSA) can be downloaded from this site.

The data tree is shown below:

csa_data/raw_data/
├── splits
│   ├── CCLE_all.txt
│   ├── CCLE_split_0_test.txt
│   ├── CCLE_split_0_train.txt
│   ├── CCLE_split_0_val.txt
│   ├── CCLE_split_1_test.txt
│   ├── CCLE_split_1_train.txt
│   ├── CCLE_split_1_val.txt
│   ├── ...
│   ├── GDSCv2_split_9_test.txt
│   ├── GDSCv2_split_9_train.txt
│   └── GDSCv2_split_9_val.txt
├── x_data
│   ├── cancer_copy_number.tsv
│   ├── cancer_discretized_copy_number.tsv
│   ├── cancer_DNA_methylation.tsv
│   ├── cancer_gene_expression.tsv
│   ├── cancer_miRNA_expression.tsv
│   ├── cancer_mutation_count.tsv
│   ├── cancer_mutation_long_format.tsv
│   ├── cancer_mutation.parquet
│   ├── cancer_RPPA.tsv
│   ├── drug_ecfp4_nbits512.tsv
│   ├── drug_info.tsv
│   ├── drug_mordred_descriptor.tsv
│   └── drug_SMILES.tsv
└── y_data
    └── response.tsv

Model scripts and parameter file

Step-by-step instructions

1. Clone the model repository and checkout the branch (or tag)

git clone git@github.com:JDACS4C-IMPROVE/LGBM.git
cd LGBM
git checkout v0.1.0-2024-09-27

2. Set computational environment

Option 1: create conda env using yml

conda env create -f conda_wo_candle.yml

Option 2: use conda_env_py37.sh

Option 3: use these commands

CONDA_ENV_NAME=lgbm_py37
conda create -n $CONDA_ENV_NAME python=3.7 pip lightgbm=3.1.1 --yes
conda activate $CONDA_ENV_NAME
conda install conda-forge::pandas=1.3.0
conda install conda-forge::scikit-learn=1.0.2
conda install conda-forge::pyyaml=6.0
conda install conda-forge::pyarrow=9.0.0

3. Run setup_improve.sh.

source setup_improve.sh

This will:

  1. Download cross-study analysis (CSA) benchmark data into ./csa_data/.
  2. Clone IMPROVE repo (and checkout v0.1.0-2024-09-27) outside the LGBM model repo.
  3. Set up PYTHONPATH (adds IMPROVE repo).

4. Preprocess CSA benchmark data (raw data) to construct model input data (ML data)

python lgbm_preprocess_improve.py --input_dir ./csa_data/raw_data --output_dir exp_result

Preprocesses the CSA data and creates train, validation (val), and test datasets.

Generates:

exp_result
├── param_log_file.txt
├── test_data.parquet
├── test_y_data.csv
├── train_data.parquet
├── train_y_data.csv
├── val_data.parquet
├── val_y_data.csv
├── x_data_gene_expression_scaler.gz
└── x_data_mordred_scaler.gz

5. Train LightGBM model

python lgbm_train_improve.py --input_dir exp_result --output_dir exp_result

Trains a LightGBM model using the model input data: train_data.parquet (training), val_data.parquet (early stopping).

Generates:

6. Run inference on test data with the trained LightGBM model

python lgbm_infer_improve.py --input_data_dir exp_result --input_model_dir exp_result --output_dir exp_result --calc_infer_score true

Evaluates the performance on a test dataset with the trained model.

Generates: