enai4bio / DeepAAI

DeepAAI
GNU Affero General Public License v3.0
27 stars 2 forks source link

A Novel Deep Learning Method for Identifying Antigen-Antibody Interactions (DeepAAI)

DeepAAI is an advanced deep learning-based tool for identifying antigen-antibody interactions.

For making DeepAAI available at no cost to the community, we have set a web service predicting antigen-antibody interactions. https://aai-test.github.io/

Architecture

Installation

pip install -r requirements.txt

Training

Before training, all the files (*.7z) in the dataset/corpus/processed_mat/ need to be decompressed. These decompressed files are processed datasets which are the input of the models.

Execute the following scripts to train antigen-antibody neutralization model with kmer features on the HIV dataset.

python model_trainer/deep_aai_kmer_embedding_cls_trainer.py --mode train

Execute the following scripts to train antigen-antibody IC50 prediction model with kmer features on the HIV dataset.

python model_trainer/deep_aai_kmer_embedding_reg_trainer.py --mode train

Execute the following scripts to train antigen-antibody neutralization model with kmer and pssm features on the HIV dataset.

python model_trainer/deep_aai_kmer_pssm_embedding_cls_trainer.py --mode train

Execute the following scripts to train antigen-antibody IC50 prediction model with kmer and pssm features on the HIV dataset.

python model_trainer/deep_aai_kmer_pssm_embedding_reg_trainer.py --mode train

Execute the following scripts to train antigen-antibody neutralization model on the SARS-CoV-2 dataset.

python model_trainer/deep_aai_kmer_embedding_cov_cls_trainer.py --mode train
Hyper-parameter in DeepAAI: Parameter Value
Dropout 0.4
Adj L1 loss 5e-4
Param L2 loss 5e-4
Amino embedding size 7
Hidden size 512
Learning rate 5e-5

Execute the following scripts to train antigen-antibody neutralization model by AG-Fast-Parapred on the HIV dataset.

python model_trainer/baseline_ag_fast_parapred_cls_trainer.py --mode train

Execute the following scripts to train antigen-antibody IC50 prediction model by AG-Fast-Parapred on the HIV dataset.

python model_trainer/baseline_ag_fast_parapred_reg_trainer.py --mode train

Execute the following scripts to train antigen-antibody neutralization model by Parapred on the HIV dataset.

python model_trainer/baseline_parapred_cls_trainer.py --mode train

Execute the following scripts to train antigen-antibody IC50 prediction model by Parapred on the HIV dataset.

python model_trainer/baseline_parapred_reg_trainer.py --mode train

Preprocessing dataset

The data pre-processing module is in the folder of processing/. There are three sub-folders in the processing folder, hiv_cls, hiv_reg, and cov_cls. The pre-processing can be understood by following the scripts of processing.py as well as the *.py under processing/hiv_cls/, processing/hiv_reg/, and processing/cov_cls/.

Noted that for each SARS-CoV-2 variant (SARS-CoV2_WT, SARS-CoV2_Alpha, SARS-CoV2_Beta, SARS-CoV2_Gamma, SARS-CoV2_Delta), five sequences were sampled, respectively. The unseen test includes SARS-CoV2_Omicron.

Under processing/hiv_cls/, processing/hiv_reg/, and processing/cov_cls/, there are processing.pys. Lines 62-82 in each processing.py correspond to how to convert the sequence into kmer, one-hot, pssm, etc.

The pssm needs to be obtained from the POSSUM (Wang, J. et al. Possum: a bioinformatics toolkit for generating numerical sequence feature descriptors based on pssm profiles. Bioinformatics 33, 2756–2758 2017) and placed in the pssm folder. We select the Uniref50 database to generate PSSMs.

Execute the following scripts to process the HIV dataset for classification.

python processing/hiv_cls/processing.py

Execute the following scripts to process the HIV dataset for regression.

python processing/hiv_reg/processing.py

Execute the following scripts to process the SARS-CoV-2 dataset.

python processing/cov_cls/processing.py

There is no additional criteria for filtering the data of the other datasets. For more details of data collection and features, please see the in the subsections of Data and Feature of the Method section in Page 9 ~ 10 of the manuscript.