a96123155 / UTR-LM

GNU General Public License v3.0
72 stars 14 forks source link

UTR-LM: A Semi-supervised 5’ UTR Language Model for mRNA Translation and Expression Prediction

The untranslated region (UTR) of an RNA molecule plays a vital role in gene expression regulation. Specifically, the 5' UTR, located at the 5' end of an RNA molecule, is a critical determinant of the RNA’s translation efficiency. Language models have demonstrated their utility in predicting and optimizing the function of protein encoding sequences and genome sequences. In this study, we developed a semi-supervised language model for 5’ UTR, which is pre-trained on a combined library of random 5' UTRs and endogenous 5' UTRs from multiple species. We augmented the model with supervised information that can be computed given the sequence, including the secondary structure and minimum free energy, which improved the semantic representation.

Implementable Version

You could run the code in CodeOcean.

File Structure

An example for eight MRL library:

Training File Test File Splitting Strategy Descript
4.1_train_data_GSM3130435_egfp_unmod_1.csv 4.1_test_data_GSM3130435_egfp_unmod_1.csv Rank GSM3130435_egfp_unmod_1
4.2_train_data_GSM3130435_egfp_unmod_1.csv 4.2_test_data_GSM3130435_egfp_unmod_1.csv Random GSM3130435_egfp_unmod_1

Install

  1. Create a conda environment.
    conda create -n UTRLM python==3.9.13
    conda activate UTRLM
  2. In UTRLM environment, install the Python packages listed in the utrlm_requirements.txt file. pip install -r utrlm_requirements.txt

Or

pip install pandas==1.4.3 
pip3 install torch torchvision torchaudio
pip install torchsummary
pip install tqdm scikit-learn scipy matplotlib seaborn
  1. Set up Model
pip install fair-esm
find -name esm
scp -r ./Scripts/esm ./.conda/envs/UTRLM/lib/python3.9/site-packages/ # Move the folder ./Scripts/esm/ to the conda env fold, such as ./.conda/envs/UTRLM/lib/python3.9/site-packages/

It is very important to Move the folder ./Scripts/esm/ to the conda env fold, such as ./.conda/envs/UTRLM/lib/python3.9/site-packages/, because we have modified the souce code of ESM.

Instruction Example

The final parameters and checkpoints are described in the below scripts.

UTR-LM pretraining process

cd ./Scripts/UTRLM_pretraining
python -m torch.distributed.launch --nproc_per_node=1 --master_port 1234 v2DDP_ESM2_alldata_SupervisedInfo.py --prefix v2DDP_try --lr 1e-5 --layers 6 --heads 6 --embed_dim 16 --train_fasta ./Data/Pretrained_Data/Fivespecies_Cao_energy_structure_CaoEnergyNormalDist_255795sequence.fasta --device_ids 0 --epochs 200

We recommend to use the following code:

Code File Decription
v2DDP_ESM2_alldata_SupervisedInfo.py MLM+MFE
v3DDP_ESM2_alldata_SupervisedInfo_SecondaryStructure.py MLM+MFE+SecondaryStructure
v4DDP_ESM2_alldata_SecondaryStructure.py MLM+SecondaryStructure

Which Parameters you MUST to define:

UTR-LM downstream fine-tuning process

1. For MRL task:
cd ./Scripts/UTRLM_downstream
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 5001 MJ3_Finetune_extract_append_predictor_Sample_10fold-lr-huber-DDP.py --device_ids 0,1,2,3 --label_type rl --epochs 300 --huber_loss --train_file 4.1_train_data_GSM3130435_egfp_unmod_1.csv --prefix ESM2SISS_FS4.1.ep93.1e-2.dr5 --lr 1e-2 --dropout3 0.5 --modelfile ./Model/Pretrained/ESM2SISS_FS4.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_lr1e-05_supervisedweight1.0_structureweight1.0_MLMLossMin_epoch93.pkl --finetune --bos_emb --test1fold

Which Parameters you MUST to define:

2. For TE and EL task:
cd ./Scripts/UTRLM_downstream
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port 9001 MJ4_Finetune_extract_append_predictor_CellLine_10fold-lr-huber-DDP.py --device_ids 0 --cell_line Muscle --label_type te_log --seq_type utr --inp_len 100 --huber_loss --modelfile ./Model/ESM2SI_3.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_MLMLossMin.pkl --finetune --bos_emb --lr 1e-2 --dropout3 0.2 --epochs 300 --prefix TE_ESM2SI_3.1.1e-2.M.dropout2

Which Parameters you MUST to define:

3. General Parameters:

Important Parameters
Other Parameters

Predict Process

If you want to quick start, please see ./Scripts/UTRLM_downstream/MJ5_Predict_and_Extract_Attention_Embedding.ipynb

Noted

Please change the directory to your own directory.

Reference

Chu, Yanyi, et al. "A 5'UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions." bioRxiv (2023): 2023-10.

Contact

Please feel free to contact us, my email is yanyichu@stanford.edu.