TABR-BERT: an Accurate and Robust BERT-based Transfer Learning Model for TCR-pMHC Interaction Prediction
Publication: https://doi.org/10.1093/bib/bbad436
Contract: hui.yao@freshwindbiotech.com
There are two ways to run TABR-BERT
The Installation of Docker can be seen in https://docs.docker.com/
Pull the image of TABR-BERT from dockerhub:
docker pull freshwindbioinformatics/tabr-bert:v1
Run the image in bash:
docker run -it --gpus all freshwindbioinformatics/tabr-bert:v1 bash
* Note : The parameter "--gpus" requires docker version higher than 19.03.
#
2. Conda and pip
# Command:
conda create -n tabr_bert python==3.9.12
conda activate tabr_bert
pip install -r requirements.txt
You can find the data used to train TCR-BERT, pMHC-BERT and healthy TCR dataset at https://zenodo.org/record/8215354
Usage: pre_train_tcr_embedding_model.py [options]
Required:
--input STRING: The input data to train the TCR embedding model (*.csv)
Required columns: "cdr3"
--model_dir STRING: where to save the model (*.pt)
Optional:
--n_layers INT: number of transformer encoder layers (default: 4)
--d_model INT: number of embedding dimention (default: 256)
--batchsize INT: mini batchsize (default: 1024)
--lr Float: learning rate (default: 5e-5)
--max_epoch INT: Maximum number of train epoch (default: 100)
--GPUs INT: num of GPUs used in this task(default: 2)
python pre_train_tcr_embedding_model.py
This requires two GPUs with more than 8G of memory, which can reduce the memory requirements by lowering the batchsize, but may affect the stability and effectiveness of training. #
Usage: pre_train_pmhc_embedding_model.py [options]
Required:
--input STRING: The input data to train the pMHC embedding model (*.csv)
Required columns: ["allele", "peptide", "label"]
--random_peptide STRING: natural peptides for generating negative cases (*.csv)
Required columns: "peptide"
--model_dir STRING: where to save the model (*.pt)
Optional:
--n_layers INT: number of transformer encoder layers (default: 4)
--d_model INT: number of embedding dimention (default: 256)
--neg_X INT: negative case multiple (default: 2)
--batchsize INT: mini batchsize (default: 1024)
--lr Float: learning rate (default: 5e-5)
--max_epoch INT: Maximum number of train epoch (default: 100)
--GPUs INT: num of GPUs used in this task(default: 2)
python pre_train_pmhc_embedding_model.py
This requires two GPUs with more than 14G of memory, which can reduce the memory requirements by lowering the batchsize, but may affect the stability and effectiveness of training. #
Usage: train_tcr_pmhc_prediction_model.py [options]
Required:
--input STRING: The input data to train the TCR-pMHC prediction model (*.csv)
Required columns: ["allele", "peptide", "cdr3"]
--healthy_tcr STRING: TCRs from healthy people for generating negative cases (*.csv)
Required columns: "cdr3"
--pseudo_sequence_dict STRING: allele name to pseudo sequence (*.csv)
Required columns: ["allele" "sequence"]
--tcr_model STRING: TCR embedding model dir (*.pt)
--pmhc_model STRING: pMHC embedding model dir (*.pt)
--model_dir STRING: where to save the model (*.pt)
Optional:
--batchsize INT: mini batchsize (default: 256)
--embedding_batchsize INT: mini batchsize of generation embedding (default: 256)
--pmhc_d_model INT: dimention of pmhc embedding (default: 256)
--tcr_d_model INT: dimention of pmhc embedding (default: 256)
--lr Float: learning rate (default: 5e-4)
--max_epoch INT: Maximum number of train epoch (default: 500)
--GPUs INT: num of GPUs used in this task(default: 2)
python train_tcr_pmhc_prediction_model.py
This requires two GPUs with more than 5G of memory, which can reduce the memory requirements by lowering the batchsize, but may affect the stability and effectiveness of training. #
Usage: predict_tcr_pmhc_binding.py [options]
Required:
--input STRING: The data to be predicted (*.csv)
Required columns: ["allele", "peptide", "cdr3"]
--healthy_tcr STRING: TCRs from healthy people for generating negative cases (*.csv)
Required columns: "cdr3"
--pseudo_sequence_dict STRING: allele name to pseudo sequence (*.csv)
Required columns: ["allele" "sequence"]
--tcr_pmhc_model STRING: TCR-pMHC prediction model dir (*.pt)
--tcr_model STRING: TCR embedding model dir (*.pt)
--pmhc_model STRING: pMHC embedding model dir (*.pt)
--output STRING: output file dir (*.csv)
Optional:
--batchsize INT: mini batchsize (default: 256)
--embedding_batchsize INT: mini batchsize of generation embedding (default: 256)
--pmhc_d_model INT: dimention of pmhc embedding (default: 256)
--tcr_d_model INT: dimention of pmhc embedding (default: 256)
--GPUs INT: num of GPUs used in this task [if you have GPU recommend 1, if not, recommend 0] (default: 0)
python predict_tcr_pmhc_binding.py --input input_data.csv
Jiawei Zhang, Wang Ma, Hui Yao, "Accurate TCR-pMHC interaction prediction using a BERT-based transfer learning method", Briefings in Bioinformatics, Volume 25, Issue 1, January 2024, bbad436, https://doi.org/10.1093/bib/bbad436