The untranslated region (UTR) of an RNA molecule plays a vital role in gene expression regulation. Specifically, the 5' UTR, located at the 5' end of an RNA molecule, is a critical determinant of the RNA’s translation efficiency. Language models have demonstrated their utility in predicting and optimizing the function of protein encoding sequences and genome sequences. In this study, we developed a semi-supervised language model for 5’ UTR, which is pre-trained on a combined library of random 5' UTRs and endogenous 5' UTRs from multiple species. We augmented the model with supervised information that can be computed given the sequence, including the secondary structure and minimum free energy, which improved the semantic representation.
You could run the code in CodeOcean.
An example for eight MRL library:
Training File | Test File | Splitting Strategy | Descript |
---|---|---|---|
4.1_train_data_GSM3130435_egfp_unmod_1.csv | 4.1_test_data_GSM3130435_egfp_unmod_1.csv | Rank | GSM3130435_egfp_unmod_1 |
4.2_train_data_GSM3130435_egfp_unmod_1.csv | 4.2_test_data_GSM3130435_egfp_unmod_1.csv | Random | GSM3130435_egfp_unmod_1 |
conda create -n UTRLM python==3.9.13
conda activate UTRLM
pip install -r utrlm_requirements.txt
Or
pip install pandas==1.4.3
pip3 install torch torchvision torchaudio
pip install torchsummary
pip install tqdm scikit-learn scipy matplotlib seaborn
pip install fair-esm
find -name esm
scp -r ./Scripts/esm ./.conda/envs/UTRLM/lib/python3.9/site-packages/ # Move the folder ./Scripts/esm/ to the conda env fold, such as ./.conda/envs/UTRLM/lib/python3.9/site-packages/
It is very important to Move the folder ./Scripts/esm/ to the conda env fold, such as ./.conda/envs/UTRLM/lib/python3.9/site-packages/, because we have modified the souce code of ESM.
The final parameters and checkpoints are described in the below scripts.
cd ./Scripts/UTRLM_pretraining
python -m torch.distributed.launch --nproc_per_node=1 --master_port 1234 v2DDP_ESM2_alldata_SupervisedInfo.py --prefix v2DDP_try --lr 1e-5 --layers 6 --heads 6 --embed_dim 16 --train_fasta ./Data/Pretrained_Data/Fivespecies_Cao_energy_structure_CaoEnergyNormalDist_255795sequence.fasta --device_ids 0 --epochs 200
We recommend to use the following code:
Code File | Decription |
---|---|
v2DDP_ESM2_alldata_SupervisedInfo.py | MLM+MFE |
v3DDP_ESM2_alldata_SupervisedInfo_SecondaryStructure.py | MLM+MFE+SecondaryStructure |
v4DDP_ESM2_alldata_SecondaryStructure.py | MLM+SecondaryStructure |
Which Parameters you MUST to define:
cd ./Scripts/UTRLM_downstream
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 5001 MJ3_Finetune_extract_append_predictor_Sample_10fold-lr-huber-DDP.py --device_ids 0,1,2,3 --label_type rl --epochs 300 --huber_loss --train_file 4.1_train_data_GSM3130435_egfp_unmod_1.csv --prefix ESM2SISS_FS4.1.ep93.1e-2.dr5 --lr 1e-2 --dropout3 0.5 --modelfile ./Model/Pretrained/ESM2SISS_FS4.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_lr1e-05_supervisedweight1.0_structureweight1.0_MLMLossMin_epoch93.pkl --finetune --bos_emb --test1fold
Which Parameters you MUST to define:
cd ./Scripts/UTRLM_downstream
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port 9001 MJ4_Finetune_extract_append_predictor_CellLine_10fold-lr-huber-DDP.py --device_ids 0 --cell_line Muscle --label_type te_log --seq_type utr --inp_len 100 --huber_loss --modelfile ./Model/ESM2SI_3.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_MLMLossMin.pkl --finetune --bos_emb --lr 1e-2 --dropout3 0.2 --epochs 300 --prefix TE_ESM2SI_3.1.1e-2.M.dropout2
Which Parameters you MUST to define:
If you want to quick start, please see ./Scripts/UTRLM_downstream/MJ5_Predict_and_Extract_Attention_Embedding.ipynb
Please change the directory to your own directory.
Please feel free to contact us, my email is yanyichu@stanford.edu.