JieZheng-ShanghaiTech / SL_benchmark

Benchmarking study of machine learning methods for prediction of synthetic lethality
MIT License
10 stars 1 forks source link

Benchmarking of Machine Learning Methods for Predicting Synthetic Lethality Interactions

DOI

Welcome to the official code repository for the paper "Benchmarking of Machine Learning Methods for Predicting Synthetic Lethality Interactions." This repository hosts the implementations of various machine learning models evaluated in our study, along with the preprocessing methods and training data necessary for synthetic lethality (SL) prediction.

About the Study

Our research conducts a thorough benchmark of recent machine learning methods, including three matrix factorization and eight deep learning models. We rigorously test model performance under diverse data splitting scenarios, negative sample ratios, and sampling methods. The focus is on both classification and ranking tasks, aiming to ascertain the models' generalizability and robustness.

Workflow of the benchmarking study

Benchmarking Process Flowchart

Benchmarking results

The following graph depicts the performance of the machine learning models across various scenarios:

Results A, B and C represent the model performance under different negative sampling methods (NSM_Rand, NSM_Exp and NSM_Dep), where lighter colors indicate better performance. The figure is structured into five key sections:

Key Highlights

Benchmarked models

Method Paper Title Article Link Code Link
GRSMF Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization GRSMF GRSMF
SL2MF SL2MF: Predicting Synthetic Lethality in Human Cancers via Logistic Matrix Factorization SL2MF SL2MF
CMFW Predicting synthetic lethal interactions using heterogeneous data sources CMFW CMFW
SLMGAE Prediction of Synthetic Lethal Interactions in Human Cancers Using Multi-View Graph Auto-Encoder SLMGAE SLMGAE
NSF4SL NSF4SL: negative-sample-free contrastive learning for ranking synthetic lethal partner genes in human cancers NSF4SL NSF4SL
PTGNN Pre-training graph neural networks for link prediction in biomedical networks PTGNN PTGNN
PiLSL PiLSL: pairwise interaction learning-based graph neural network for synthetic lethality prediction in human cancers PiLSL PiLSL
KG4SL KG4SL: knowledge graph neural network for synthetic lethality prediction in human cancers KG4SL KG4SL
SLGNN SLGNN: Synthetic lethality prediction in human cancers based on factor-aware knowledge graph neural network SLGNN SLGNN
DDGCN Dual-dropout graph convolutional network for predicting synthetic lethality in human cancers DDGCN DDGCN
GCATSL Graph contextualized attention network for predicting synthetic lethality in human cancers GCATSL GCATSL
MGE4SL Predicting Synthetic Lethality in Human Cancers via Multi-Graph Ensemble Neural Network MGE4SL MGE4SL

Other SL prediction models

Machine Learning-Based Methods

Method Paper Title Article Link
Paladugu et al. Mining protein networks for synthetic genetic interactions Paladugu et al.
Pandey et al. An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions Pandey et al.
MetaSL In Silico Prediction of Synthetic Lethality by Meta-Analysis of Genetic Interactions, Functions, and Pathways in Yeast and Human Cancer MetaSL
EXP2SL EXP2SL: A Machine Learning Framework for Cell-Line-Specific Synthetic Lethality Prediction EXP2SL
DiscoverSL DiscoverSL: an R package for multi-omic data driven prediction of synthetic lethality in cancers DiscoverSL
Li et al. Identification of synthetic lethality based on a functional network by using machine learning algorithms Li et al.
SLant Predicting synthetic lethal interactions using conserved patterns in protein interaction networks SLant
Wu et al. Synthetic Lethal Interactions Prediction Based on Multiple Similarity Measures Fusion Wu et al.
De Kegel et al. Comprehensive prediction of robust synthetic lethality between paralog pairs in cancer cell lines De Kegel et al.
PARIS Uncovering cancer vulnerabilities by machine learning prediction of synthetic lethality PARIS
SBSL Overcoming selection bias in synthetic lethality prediction SBSL
ELISL ELISL: early–late integrated synthetic lethality prediction in cancer ELISL

Deep Learning-Based Methods

Method Paper Title Article Link
MAGCN MAGCN: A Multiple Attention Graph Convolution Networks for Predicting Synthetic Lethality MAGCN
MVGCN-iSL Multi-view graph convolutional network for cancer cell-specific synthetic lethality prediction MVGCN-iSL

Repository Structure

This repository is organized as follows:

How to use

Enviroment Preparation

Establish the working environment for this study using Anaconda.

conda env create -f SL-Benchmark.yml

Additionally, there are several packages that need to be installed through local files.

conda activate SLBench
pip install ./torch_spline_conv-latest+cu102-cp37-cp37m-linux_x86_64.whl
pip install ./torch_sparse-latest+cu102-cp37-cp37m-linux_x86_64.whl
pip install ./torch_scatter-latest+cu102-cp37-cp37m-linux_x86_64.whl
pip install ./torch_cluster-latest+cu102-cp37-cp37m-linux_x86_64.whl

Data Preparation and Download Instructions

Follow these steps to download and prepare the training data:

Step 1: Download all the data parts from the Zenodo link provided in the repository.

[!TIP]

The actual command will depend on how you're downloading files from Google Drive. For the convenience of downloading, the files have been compressed and split.

[!IMPORTANT]

We have prepared two versions of data, namely the complete version (data_large.tar.gz) and the version without PiLSL database (data_small.tar.gz).

The decompressed size of the complete version is about 90GB, while the decompressed size of the version without PiLSL database is about 22GB.

(Due to the extremely time-consuming database construction process, it is recommended to download the complete version of the data. However, the data version without PiLSL database can successfully run all 10 models except PiLSL.)

Step 2: Combine the parts into a single archive.

cat data_large.tar.gz.part* > data_large.tar.gz     # Complete version, the size after extracted is about 100GB.
# cat data_small.tar.gz.part* > data_small.tar.gz   # The version without PiLSL database, the size after extracted is about 25GB.

Step 3: Verify the integrity of the downloaded files.

md5sum -c data_large.tar.gz.md5
# md5sum -c data_small.tar.gz.md5 # The version without PiLSL database

Step 4: Extract the dataset.

tar -xzvf data_large.tar.gz
# tar -xzvf data_small.tar.gz # The version without PiLSL database

Run models

# Navigate to the src directory
cd path/to/results
mkdir Rand_score_mats Exp_score_mats Dep_score_mats score_dist
cd path/to/src
python main.py -m SLMGAE \ # Choose the SL prediction method among 'GRSMF', 'SL2MF', 'CMFW', 'SLMGAE', 'NSF4SL', 'PTGNN', 'PiLSL', 'KG4SL', 'SLGNN', 'DDGCN', 'GCATSL' and 'MGE4SL'.
               -ns Rand \ # Choose the negative sampling method with 'Rand', 'Exp', or 'Dep'.
               -ds CV1 \ # Select the data splitting method with 'CV1', 'CV2', or 'CV3'.
               -pn 1 \ # Set the positive to negative ratio with '1', '5', '20', or '50'.
               --save_mat # Save the score matrix for model predictions. (This may take up a lot of disk space.)

[!TIP]

Considering that network conditions may affect the initialization of wandb, we recommend keeping wandb's log data locally by using the following command: wandb offline When you need to upload this data to the Wandb server, you can use the following command line to upload all local logs to the cloud: wandb sync --include-offline

[!IMPORTANT]

Ensure you have at least 500GB of free space to store training data and model prediction results

Future Directions

Beyond providing the necessary tools for SL prediction, this repository serves as a foundation for future improvements in the predictive accuracy and interpretability of ML methods in SL discovery.

We encourage the scientific community to leverage this repository for advancing the research in synthetic lethality and the pursuit of precision medicine in oncology.