alibaba / LucaProt

LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.
Apache License 2.0
129 stars 21 forks source link

LucaProt

LucaProt(DeepProtFunc) is an open source project developed by Alibaba and licensed under the Apache License (Version 2.0).

This product contains various third-party components under other open source licenses.
See the NOTICE file for more information.

Notice:
This project provides the Python dependency environment installation file, installation commands, and the running command of the trained LucaProt model for inference or prediction, which can be found in this repository. These models are compatible with Linux, Mac OS, and Windows systems, supporting both CPU and GPU configurations for inference tasks.

TimeLine

LucaProt ServerHot Badge

LucaProt Server(CPU) is available at: https://lucaprot.org.
Limit inference to a maximum of 100 sequences at a time.
The GPU version will come soon.

LucaProt Server LucaProt Server

Introduction

LucaProt: A novel deep learning framework that incorporates protein amino acid sequence and structural information to predict protein function.

1. Model

1) Model Introduction

We developed a new deep learning model, namely, Deep Sequential and Structural Information Fusion Network for Proteins Function Prediction (DeepProtFunc/LucaProt), which takes into account protein sequence and structural information to facilitate the accurate annotation of protein function.

Here, we applied LucaProt to identify viral RdRP.

2) Model Architecture

We treat protein function prediction as a classification problem. For example, viral RdRP identification is a binary-class classification task, and protein general function annotation is a multi-label classification task. The model includes five modules: Input, Tokenizer, Encoder, Pooling, and Output. Its architecture is shown in Figure 1.

Figure 1 The Architecture of LucaProt

3) Model Input/Output

Use the amino acid letter sequence as the input of our model. The model outputs the function label of the input protein, which is a single tag (binary-class classification or multi-class classification) or a set of tags (multi-label classification).

2. Dependence

System: Ubuntu 20.04.5 LTS
Python: 3.9.13
Download anaconda: anaconda
Cuda: cuda11.7 (torch==1.13.1)

# Select 'YES' during installation for initializing the conda environment  
sh Anaconda3-2022.10-Linux-x86_64.sh  
# Source the environment
source ~/.bashrc  
# Verification
conda  
# Install env and python 3.9.13   
conda create -n lucaprot python=3.9.13    
# activate env
conda activate lucaprot  
# Install git      
sudo apt-get update         
sudo apt install git-all

# Enter the project   
cd LucaProt     

# Install
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple        

3. Inference

You can simply use this project to infer or predict for unknown sequences.

1) Prediction from one sample

cd LucaProt/src/prediction/ 
sh run_predict_one_sample.sh

Note: the embedding matrix of the sample is real-time predictive.

Or:

cd LucaProt/src/

# using GPU(cuda=0)    
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict_one_sample.py \
    --protein_id protein_1 \
    --sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ \
    --emb_dir ./emb/ \
    --truncation_seq_length 4096 \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5 \
    --gpu_id 0

# using CPU(gpu_id=-1)    
python predict_one_sample.py \
    --protein_id protein_1 \
    --sequence MTTSTAFTGKTLMITGGTGSFGNTVLKHFVHTDLAEIRIFSRDEKKQDDMRHRLQEKSPELADKVRFFIGDVRNLQSVRDAMHGVDYIFHAAALKQVPSCEFFPMEAVRTNVLGTDNVLHAAIDEGVDRVVCLSTDKAAYPINAMGKSKAMMESIIYANARNGAGRTTICCTRYGNVMCSRGSVIPLFIDRIRKGEPLTVTDPNMTRFLMNLDEAVDLVQFAFEHANPGDLFIQKAPASTIGDLAEAVQEVFGRVGTQVIGTRHGEKLYETLMTCEERLRAEDMGDYFRVACDSRDLNYDKFVVNGEVTTMADEAYTSHNTSRLDVAGTVEKIKTAEYVQLALEGREYEAVQ \
    --emb_dir ./emb/ \
    --truncation_seq_length 4096 \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --threshold 0.5 \
    --gpu_id -1

2) Prediction from many samples

the samples are in *.fasta, sample by sample prediction.

cd LucaProt/src/prediction/   
sh run_predict_many_samples.sh

Or:

cd LucaProt/src/

# using GPU(cuda=0)   
export CUDA_VISIBLE_DEVICES="0,1,2,3"  
python predict_many_samples.py \
    --fasta_file ../data/rdrp/test/test.fasta  \
    --save_file ../result/rdrp/test/test_result.csv  \
    --emb_dir ../emb/   \
    --truncation_seq_length 4096  \
    --dataset_name rdrp_40_extend  \
    --dataset_type protein     \
    --task_type binary_class     \
    --model_type sefn     \
    --time_str 20230201140320   \
    --step 100000  \
    --threshold 0.5 \
    --print_per_number 10 \
    --gpu_id 0

# using CPU(gpu_id=-1)               
python predict_many_samples.py \
    --fasta_file ../data/rdrp/test/test.fasta  \
    --save_file ../result/rdrp/test/test_result.csv  \
    --emb_dir ../emb/   \
    --truncation_seq_length 4096  \
    --dataset_name rdrp_40_extend  \
    --dataset_type protein     \
    --task_type binary_class     \
    --model_type sefn     \
    --time_str 20230201140320   \
    --step 100000  \
    --threshold 0.5 \
    --print_per_number 10 \
    --gpu_id -1

3) Prediction from the file(embedding file exists in advance)

The test data (small and real) is in demo.csv, where the 7th column of each line is the filename of the structural embedding information prepared in advance.
And the structural embedding files store in embs.

The test data includes 50 viral-RdRPs and 50 non-viral RdRPs.

cd LucaProt/src/prediction/   
sh run_predict_from_file.sh

Or:

cd LucaProt/src/

# using GPU(cuda=0)   
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --evaluate \
    --threshold 0.5 \
    --batch_size 16 \
    --print_per_batch 100 \
    --gpu_id 0 

# using CPU(gpu_id=-1)          
python predict.py \
    --data_path ../data/rdrp/demo/demo.csv \
    --emb_dir ../data/rdrp/demo/embs/esm2_t36_3B_UR50D \
    --dataset_name rdrp_40_extend \
    --dataset_type protein \
    --task_type binary_class \
    --model_type sefn \
    --time_str 20230201140320 \
    --step 100000 \
    --evaluate \
    --threshold 0.5 \
    --batch_size 16 \
    --print_per_batch 100 \
    --gpu_id -1

Note: the embedding matrices of all the proteins in this file need to prepare in advance($emb_dir).

4. 11 independent validation datasets

11 verification datasets unrelated to the model building dataset, include 7 exists viral-RdRP datasets and 4 exists non viral-RdRP datasets.
Run the prediction python script https://github.com/alibaba/LucaProt/src/predict_many_samples.py
The performance on these 11 independent verification datasets of LucaProt.
LucaProt-Performance-On-11-Independent-Datasets.xlsx
or LucaProt Figshare

5. LucaProt App

This project is used to predict unlabeled protein sequences and to measure the time spent.
LucaProtApp or LucaProt Figshare

6. Inference Time

LucaProt is suitably speedy because it only needs to predict the structural representation matrix rather than the complete 3D structure of the protein sequence.

Benchmark: For each sequence length range(total 10 groups), selected 50 viral-RdRPS and 50 non-viral RdRPs for each group for inference time cost calculation.
inference_time_data_of_github.csv
or LucaProt Figshare

Note: The spend time includes the time of the structural representation matrix inference, excludes the time of model loading.

1) GPU(Nvidia A100, Cuda: 11.7)

Notice: when the sequence length does not exceed 1024, you can use the 24GB GPU for inference, such as the A10.

Protein Seq Len Range Average Time Maximum Time Minimum Time
300 <= Len < 500 0.20s 0.24s 0.16s
500 <= Len < 800 0.30s 0.39s 0.24s
800 <= Len < 1,000 0.42s 0.46s 0.39s
1,000 <= Len < 1,500 0.59s 0.74s 0.45s
1,500 <= Len < 2,000 0.87s 1.02s 0.73s
2,000 <= Len < 3,000 1.31s 1.69s 1.01s
3,000 <= Len < 5,000 2.14s 2.78s 1.72s
5,000 <= Len < 8,000 3.03s 3.45s 2.65s
8,000 <= Len < 10,000 3.77s 4.24s 3.32s
10,000 <= Len 9.92s 17.66s 4.30s

2) CPU (16 cores, 64G memory of Alibaba Cloud ECS)

Protein Seq Len Range Average Time Maximum Time Minimum Time
300 <= Len < 500 3.97s 5.71s 2.77s
500 <= Len < 800 5.78s 7.50s 4.48s
800 <= Len < 1,000 8.23s 9.41s 7.41s
1,000 <= Len < 1,500 11.49s 16.42s 9.22s
1,500 <= Len < 2,000 17.71s 22.36s 14.93s
2,000 <= Len < 3,000 26.97s 36.68s 20.99s
3,000 <= Len < 5,000 45.56s 58.42s 35.82s
5,000 <= Len < 8,000 56.57s 58.17s 55.55s
8,000 <= Len < 10,000 57.76s 58.86s 56.66s
10,000 <= Len 66.49s 76.80s 58.42s

3) CPU (96 cores, 768G memory of Alibaba Cloud ECS)

Protein Seq Len Range Average Time Maximum Time Minimum Time
300 <= Len < 500 1.89s 2.55s 1.10s
500 <= Len < 800 2.68s 3.44s 2.13s
800 <= Len < 1,000 3.45s 4.25s 2.65s
1,000 <= Len < 1,500 4.27s 5.90s 3.54s
1,500 <= Len < 2,000 5.81s 7.44s 4.76s
2,000 <= Len < 3,000 8.14s 10.74s 6.37s
3,000 <= Len < 5,000 13.25s 17.69s 10.06s
5,000 <= Len < 8,000 17.03s 18.20s 15.98s
8,000 <= Len < 10,000 17.90s 18.99s 16.92s
10,000 <= Len 25.90s 35.02s 18.66s

7. Dataset for Virus RdRP

1) Fasta

2) Structural embedding(matrix and vector)

All structural embedding files of the dataset for model building are available at: embs
All structural embedding files of the prediction data for opening are in the process(because of the amount of data).

3) PDB (3D Structure)

All 3D-structure PDB files of the model building dataset and predicted data for opening are in the process (because of the amount of data).

4) Vocab

5) Label

Viral RdRP identification is a binary-class classification task, including positive and negative classes, using 0 and 1 to represent a negative and positive sample, respectively. The label list file is dataset/rdrp_40_extend/protein/binary_class/label.txt
label.txt

6) Dataset

We constructed a data set with 235,413 samples for model building, which included 5,979 positive samples of known viral RdRPs (i.e. the well-curated RdRP database described in the previous section of Methods), and 229,434 (to maintain a 1:40 ratio for viral RdRP and non-virus RdRPs) negative samples of confirmed non-virus RdRPs. And the non-virus RdRPs contained proteins from Eukaryota DNA dependent RNA polymerase (Eu DdRP, N=1,184), Eukaryota RNA dependent RNA polymerase (Eu RdRP, N=2,233), Reverse Transcriptase (RT, N=48,490), proteins obtained from DNA viruses (N=1,533), non-RdRP proteins obtained from RNA viruses (N=1,574), and a wide array of cellular proteins from different functional categories (N=174,420). We randomly divided the dataset into training, validation, and testing sets with a ratio of 8.5:1:1, which were used for model fitting, model finalization (based on the best F1-score training iteration), and performance reporting (including accuracy, precision, recall, F1-score, and Area under the ROC Curve (AUC)), respectively.

One row in all the above files represents one sample. All three files consist of 9 columns, including prot_id, seq, seq_len, pdb_filename, ptm, mean_plddt, emb_filename, label, and source. The details of these columns are as follows:

Note: if using strategy one in structure encoder, the pdb_filename, the ptm, and the mean_plddt can be null.

8. Supported Task Types

You can use this project to train models for other tasks, not just the viral RdRP identification tasks.

9. Building Your Model(for re-training or training with new datasets)

1) Prediction of protein 3D-structure(Optional)

The script structure_from_esm_v1.py is in the directory "src/protein_structure", and it use ESMFold (esmfold_v1) to predict 3D-Structure of protein.

I. Prediction from file

cd LucaProt/src/protein_structure/     

export CUDA_VISIBLE_DEVICES=0

python structure_from_esm_v1.py \
    -i data/rdrp/rdrp.fasta \
    -o pdbs/rdrp/ \
    --num-recycles 4 \
    --truncation_seq_length 4096 \
    --chunk-size 64 \
    --cpu-offload \
    --batch_size 1

Parameters:

II. Prediction from input sequences

cd LucaProt/src/protein_structure/    

export CUDA_VISIBLE_DEVICES=0

python structure_from_esm_v1.py \
    -name protein_id1,protein_id2  \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    -o pdbs/rdrp/  \
    --num-recycles 4 \
    --truncation_seq_length 4096 \
    --chunk-size 64 \
    --cpu-offload \
    --batch_size 1

Parameters:

2) Prediction of protein structural embedding

The script embedding_from_esmfold.py is in "src/protein_structure", and it use ESMFold (esm2_t36_3B_UR50D) to predict protein structural embedding matrices or vectors.

I. Prediction from file

cd LucaProt/src/protein_structure/    

export CUDA_VISIBLE_DEVICES=0  

python embedding_from_esmfold.py \
    --model_name esm2_t36_3B_UR50D \
    --file data/rdrp.fasta \
    --output_dir emb/rdrp/ \
    --include per_tok contacts bos \
    --truncation_seq_length 4094 

Parameters:

II. Prediction from input sequences

cd LucaProt/src/protein_structure/     

export CUDA_VISIBLE_DEVICES=0  

python embedding_from_esmfold.py \
    --model_name esm2_t36_3B_UR50D \
    -name protein_id1,protein_id2 \
    -seq VGGLFDYYSVPIMT,LPDSWENKLLTDLILFAGSFVGSDTCGKLF \
    --output_dir embs/rdrp/test/ \
    --include per_tok contacts bos \
    --truncation_seq_length 4094

Parameters:

3) Construct dataset for model building

Construct your dataset and randomly divide the dataset into training, validation, and testing sets with a specified ratio, and save the three sets in dataset/${dataset_name}/${dataset_type}/${task_type}, including train*.csv, dev.csv, test_.csv.

The file format can be .csv (must include the header ) or .txt (does not need to have the header).

Each file line is a sample containing 9 columns, including prot_id, seq, seq_len, pdb_filename, ptm, mean_plddt, emb_filename, label, and source.

Colunm seq is the sequence, Colunm pdb_filename is the saved PDB filename for structure encoder strategy 2, Colunm ptm and Column mean_plddt are optional, which are obtained from the 3D-Structure computed model, Colunm emb_filename is the saved embedding filename for structure encoder strategy 1, Column label is the sample class(a single value or a list value of label index or label name). Column source is the sample source (optional).

For example:

like_YP_009351861.1_Menghai_flavivirus,MEQNG...,3416,,,,embedding_21449.pt,1,rdrp

Note: if your dataset takes too much space to load into memory at once,
use "src/data_process/data_preprocess_into_tfrecords_for_rdrp.py" to convert the dataset into "tfrecords". And create an index file: python -m tfrecord.tools.tfrecord2idx xxxx.tfrecords xxxx.index

4) Training the model

5) Training Logging Information

logs

The running information is saved in "logs/${dataset_name}/${dataset_type}/${task_type}/${model_type}/${time_str}/logs.txt".

The information includes the model configuration, model layers, running parameters, and evaluation information.

models

The checkpoints are saved in "models/${dataset_name}/${dataset_type}/${task_type}/${model_type}/${time_str}/checkpoint-${global_step}/", this directory includes "pytorch_model.bin", "config.json", "training_args.bin", and tokenizer information "sequence" or "strcut". The details are shown in Figure 2.

Figure 2: The File List in Checkpoint Dir Path

tb-logs

The metrics are recorded in "tb-logs/${dataset_name}/${dataset_type}/${task_type}/${model_type}/${time_str}/events.out.tfevents.xxxxx.xxxxx"

run: tensorboard --logdir=tb-logs/${dataset_name}/${dataset_type}/${task_type}/${model_type}/${time_str} --bind_all

predicts

The predicted results is saved in "predicts/${dataset_name}/${dataset_type}/${task_type}/${model_type}/${time_str}/checkpoint-${global_step}", including:

The details are shown in Figure 3.

Figure 3: The File List in Prediction Dir Path

Note: when using the saved model to predict, the "logs.txt" and the checkpoint dirpath will be used.

10. Related to the Project

1) ClstrSearch

A conventional approach that clustered all proteins based on their sequence homology.

See ClstrSerch/README.md for details.

2) src

Construct RdRP Dataset for Model Building

*.py in "src/data_preprocess"

Model

*.py in "src/SSFN"

Prediction Shell Script

*.sh in "src/prediction"
including:

We perform ablation studies on our model by removing specific module(sequence-specific and embedding-specific) one at a time to explore their relative importance.

Baselines

*.py in "src/baselines", using the embedding vector as the input, including:

Baselines for Deep Learning

*.py in "src/deep_baselines", including:

CHEER: HierarCHical taxonomic classification for viral mEtagEnomic data via deep learning(2021). code: CHEER

VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data(2022). code: VirHunter

Virtifier: a deep learning-based identifier for viral sequences from metagenomes(2022). code: Virtifier

RNN-VirSeeker: RNN-VirSeeker: A Deep Learning Method for Identification of Short Viral Sequences From Metagenomes. code: RNN-VirSeeker

Contact Map Generator

*.py in "src/biotoolbox"

Loss & Metrics

*.py in "src/common"

Training Model

*.sh in "src/training"

Prediction of Model

*.sh in "src/prediction"

3) Data

Raw Data

the raw data is in "data/".

Dataset

the files of the dataset is in "dataset/${dataset_name}/${dataset_type}/${task_type}/".

4) Model Configuration

the configuration file of all methods is in "config/${dataset_name}/${dataset_type}/${task_type}/".

5) Pic

some pictures is in "pics/".

6) Plot

the scripts of pictures ploting is in "src/plot".

7) Spider

the codes and results of Geo information Spider in "src/geo_map".

11. Open Resource

The open resources of our study ar includes 10 subdirectories: Known_RdRPs, Benchmark, Results, All_Contigs, All_Protein_Sequences, SG_predicted_protein_structure_supplementation/, Serratus, Self_Sequencing_Proteins, Self_Sequencing_Reads, and LucaProt.
Please refer to README.md or LucaProt Figshare .

LucaProt/ includes some resources related to LucaProt, including dataset for model building(dataset_for_model_building), dataset for_model evaluation(dataset_for_model_evaluation), and our trained model(logs/ and models/).

1) Code

As mentioned above.

2) Dataset

Model Building Dataset

Model Testing (Validation) Dataset

Results

Self-Samples

3) Trained Model

The trained model for RdRP identification is available at:
Notice: these files were already downloaded in this GitHub project, so you don't need to download them.

12. Contributor

LucaTeam:
Yong He, Zhaorong Li, Xin Hou, Mang Shi, Pan Fang

13. FTP

FTP: The all data of LucaProt is available at the website: Open Resources
Figshare: https://doi.org/10.6084/m9.figshare.26298802.v13

14. Citation

Cell:
https://www.cell.com/cell/fulltext/S0092-8674(24)01085-7

@article {LucaProt,
author = {Xin Hou, Yong He, Pan Fang, Shi-Qiang Mei, Zan Xu, Wei-Chen Wu, Jun-Hua Tian, Shun Zhang, Zhen-Yu Zeng, Qin-Yu Gou, Gen-Yang Xin, Shi-Jia Le, Yin-Yue Xia, Yu-Lan Zhou, Feng-Ming Hui, Yuan-Fei Pan, John-Sebastian Eden, Zhao-Hui Yang, Chong Han, Yue-Long Shu, Deyin Guo, Jun Li, Edward C Holmes, Zhao-Rong Li and Mang Shi},
title = {Using artificial intelligence to document the hidden RNA virosphere},
year = {2024},
doi = {10.1016/j.cell.2024.09.027},
publisher = {Cell Press},
URL = {https://doi.org/10.1016/j.cell.2024.09.027},
eprint = {https://www.cell.com/cell/fulltext/S0092-8674(24)01085-7},
journal = {Cell}
}