Welcome to the official code repository for the paper "Benchmarking of Machine Learning Methods for Predicting Synthetic Lethality Interactions." This repository hosts the implementations of various machine learning models evaluated in our study, along with the preprocessing methods and training data necessary for synthetic lethality (SL) prediction.
Our research conducts a thorough benchmark of recent machine learning methods, including three matrix factorization and eight deep learning models. We rigorously test model performance under diverse data splitting scenarios, negative sample ratios, and sampling methods. The focus is on both classification and ranking tasks, aiming to ascertain the models' generalizability and robustness.
The following graph depicts the performance of the machine learning models across various scenarios:
A, B and C represent the model performance under different negative sampling methods (NSM_Rand
, NSM_Exp
and NSM_Dep
), where lighter colors indicate better performance. The figure is structured into five key sections:
Method | Paper Title | Article Link | Code Link |
---|---|---|---|
GRSMF | Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization | GRSMF | GRSMF |
SL2MF | SL2MF: Predicting Synthetic Lethality in Human Cancers via Logistic Matrix Factorization | SL2MF | SL2MF |
CMFW | Predicting synthetic lethal interactions using heterogeneous data sources | CMFW | CMFW |
SLMGAE | Prediction of Synthetic Lethal Interactions in Human Cancers Using Multi-View Graph Auto-Encoder | SLMGAE | SLMGAE |
NSF4SL | NSF4SL: negative-sample-free contrastive learning for ranking synthetic lethal partner genes in human cancers | NSF4SL | NSF4SL |
PTGNN | Pre-training graph neural networks for link prediction in biomedical networks | PTGNN | PTGNN |
PiLSL | PiLSL: pairwise interaction learning-based graph neural network for synthetic lethality prediction in human cancers | PiLSL | PiLSL |
KG4SL | KG4SL: knowledge graph neural network for synthetic lethality prediction in human cancers | KG4SL | KG4SL |
SLGNN | SLGNN: Synthetic lethality prediction in human cancers based on factor-aware knowledge graph neural network | SLGNN | SLGNN |
DDGCN | Dual-dropout graph convolutional network for predicting synthetic lethality in human cancers | DDGCN | DDGCN |
GCATSL | Graph contextualized attention network for predicting synthetic lethality in human cancers | GCATSL | GCATSL |
MGE4SL | Predicting Synthetic Lethality in Human Cancers via Multi-Graph Ensemble Neural Network | MGE4SL | MGE4SL |
Method | Paper Title | Article Link |
---|---|---|
Paladugu et al. | Mining protein networks for synthetic genetic interactions | Paladugu et al. |
Pandey et al. | An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions | Pandey et al. |
MetaSL | In Silico Prediction of Synthetic Lethality by Meta-Analysis of Genetic Interactions, Functions, and Pathways in Yeast and Human Cancer | MetaSL |
EXP2SL | EXP2SL: A Machine Learning Framework for Cell-Line-Specific Synthetic Lethality Prediction | EXP2SL |
DiscoverSL | DiscoverSL: an R package for multi-omic data driven prediction of synthetic lethality in cancers | DiscoverSL |
Li et al. | Identification of synthetic lethality based on a functional network by using machine learning algorithms | Li et al. |
SLant | Predicting synthetic lethal interactions using conserved patterns in protein interaction networks | SLant |
Wu et al. | Synthetic Lethal Interactions Prediction Based on Multiple Similarity Measures Fusion | Wu et al. |
De Kegel et al. | Comprehensive prediction of robust synthetic lethality between paralog pairs in cancer cell lines | De Kegel et al. |
PARIS | Uncovering cancer vulnerabilities by machine learning prediction of synthetic lethality | PARIS |
SBSL | Overcoming selection bias in synthetic lethality prediction | SBSL |
ELISL | ELISL: early–late integrated synthetic lethality prediction in cancer | ELISL |
Method | Paper Title | Article Link |
---|---|---|
MAGCN | MAGCN: A Multiple Attention Graph Convolution Networks for Predicting Synthetic Lethality | MAGCN |
MVGCN-iSL | Multi-view graph convolutional network for cancer cell-specific synthetic lethality prediction | MVGCN-iSL |
This repository is organized as follows:
data/
: This directory is meant to contain the dataset required for training the models. Given the large size of the data files, we have compressed and uploaded them to Google Drive for users to download.
results/
: This directory will store the prediction results of the models. It is currently empty and will be populated with data as you run the models.
src/
: Main source directory.
config.py
: Configuration settings for the models.main.py
: Entry point of the SL prediction models.models/
: Contains the model implementations used in the study.
*.py
: Each model has its own Python file (e.g., ddgcn.py
, gcatsl.py
, etc.).preprocess.py
: Script for data preprocessing.summary_metrics.ipynb
: Jupyter notebook for summarizing results.train/
: Training scripts for each model.utils/
: Utility scripts that support model operations and data manipulation.wandb/
: Weights & Biases tracking files for experiment tracking. (It will be automatically created during the runtime.)preprocess_exp_dep_scores.ipynb
: Notebook detailing preprocessing of experimental dependency scores.Establish the working environment for this study using Anaconda.
conda env create -f SL-Benchmark.yml
Additionally, there are several packages that need to be installed through local files.
conda activate SLBench
pip install ./torch_spline_conv-latest+cu102-cp37-cp37m-linux_x86_64.whl
pip install ./torch_sparse-latest+cu102-cp37-cp37m-linux_x86_64.whl
pip install ./torch_scatter-latest+cu102-cp37-cp37m-linux_x86_64.whl
pip install ./torch_cluster-latest+cu102-cp37-cp37m-linux_x86_64.whl
Follow these steps to download and prepare the training data:
Step 1: Download all the data parts from the Zenodo link provided in the repository.
[!TIP]
The actual command will depend on how you're downloading files from Google Drive. For the convenience of downloading, the files have been compressed and split.
[!IMPORTANT]
We have prepared two versions of data, namely the complete version (
data_large.tar.gz
) and the version without PiLSL database (data_small.tar.gz
).The decompressed size of the complete version is about 90GB, while the decompressed size of the version without PiLSL database is about 22GB.
(Due to the extremely time-consuming database construction process, it is recommended to download the complete version of the data. However, the data version without PiLSL database can successfully run all 10 models except PiLSL.)
Step 2: Combine the parts into a single archive.
cat data_large.tar.gz.part* > data_large.tar.gz # Complete version, the size after extracted is about 100GB.
# cat data_small.tar.gz.part* > data_small.tar.gz # The version without PiLSL database, the size after extracted is about 25GB.
Step 3: Verify the integrity of the downloaded files.
md5sum -c data_large.tar.gz.md5
# md5sum -c data_small.tar.gz.md5 # The version without PiLSL database
Step 4: Extract the dataset.
tar -xzvf data_large.tar.gz
# tar -xzvf data_small.tar.gz # The version without PiLSL database
# Navigate to the src directory
cd path/to/results
mkdir Rand_score_mats Exp_score_mats Dep_score_mats score_dist
cd path/to/src
python main.py -m SLMGAE \ # Choose the SL prediction method among 'GRSMF', 'SL2MF', 'CMFW', 'SLMGAE', 'NSF4SL', 'PTGNN', 'PiLSL', 'KG4SL', 'SLGNN', 'DDGCN', 'GCATSL' and 'MGE4SL'.
-ns Rand \ # Choose the negative sampling method with 'Rand', 'Exp', or 'Dep'.
-ds CV1 \ # Select the data splitting method with 'CV1', 'CV2', or 'CV3'.
-pn 1 \ # Set the positive to negative ratio with '1', '5', '20', or '50'.
--save_mat # Save the score matrix for model predictions. (This may take up a lot of disk space.)
[!TIP]
Considering that network conditions may affect the initialization of
wandb
, we recommend keepingwandb
's log data locally by using the following command:wandb offline
When you need to upload this data to the Wandb server, you can use the following command line to upload all local logs to the cloud:wandb sync --include-offline
[!IMPORTANT]
Ensure you have at least 500GB of free space to store training data and model prediction results
Beyond providing the necessary tools for SL prediction, this repository serves as a foundation for future improvements in the predictive accuracy and interpretability of ML methods in SL discovery.
We encourage the scientific community to leverage this repository for advancing the research in synthetic lethality and the pursuit of precision medicine in oncology.