HanwenXuTHU / BioTranslatorProject

MIT License
33 stars 7 forks source link

BioTranslator

Section 1: Introduction

BioTranslator is a cross-modal translator which can annotate biology instances only using user-written texts. The codes here can reproduce the main results in BioTranslator paper, including zero-shot protein function prediction, zero-shot cell type prediction, and predict the nodes and edges of a gene pathway. BioTranslator takes a user-written textual description of the new discovery and then translates this description to a non-text biological data instance. Our tool frees scientists from limiting their analysis within predefined controlled vocabularies, thus accelerating new biomedical discoveries.

Update 08/05/2022: We add the fairseq version of BioTranslator. Please refer to the BioTranslator_Fairseq/scripts/ for how to perform the protein function prediction and cell type discovery.

Update 04/11/2023: We now released our BioTranslator python package and tutorials!

Section 2: Installation Tutorial

Section 2.1: System Requirement

BioTranslator is implemented using Python 3.7 in LINUX. BioTranslator requires torch==1.7.1+cu110, torchvision==0.8.2+cu110, numpy, pandas, sklearn, transformers, networkx, seaborn, tokenizers and so on. BioTranslator requires you have one GPU device to run the codes.

Section 2.2: How to use our codes

The function annotation, cell type discovery and pathway analysis task in our paper are put in Protein, SingleCell and Pathway respectively. The main codes are in the BioTranslator folder and the struture of the project is

BioTranslator/  
├── __init__.py/
├── BioConfig.py/ 
├── BioLoader.py/ 
├── BioMetrics.py/ 
├── BioModel.py/ 
├── BioTrain.py/   
├── BioUtils.py/  

The first step of BioTranslator is to train a text encoder with contrastive learning on 225 ontologies data.

Section 2.3 Train a text encoder

First please download the Graphine dataset and unzip it. Then you need to specify the path where you unzip the Graphine dataset and the path you save the trained text encoder in TextEncoder/train_text_encoder.py.

For example,

# the path where you save the model
save_model = 'model/text_encoder.pth'
# the path where you store the data
graphine_repo = '/data/Graphine/dataset/'

The training process will take several hours, please wait patiently or you can directly download the trained text encoder model.

Section 2.4 New Functions Annotation

You can run Protein/main.py to reproduce results of protein function prediction. We provide the command line interface. We also release the codes of baselines here. You can specify which method you like to run. First please download the Protein_Pathway dataset provided in our paper and unzip it. The following command lines can reproduce our results in the zero shot task. For example, you can run BioTranslator on the GOA (Human) dataset.

python Protein/main.py --method BioTranslator --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
working_space/  
├── zero_shot/  
│   ├── log/  
│   ├── results/ 
|   └── model/
└── few_shot/  
    ├── log/  
    ├── results/ 
    └── model/ 

The inference results will be saved in results/$method$_$dataset$.pkl. Then you can run the codes on different dataset.

python Protein/main.py --method BioTranslator --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method BioTranslator --dataset GOA_Mouse --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method BioTranslator --dataset GOA_Yeast --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method BioTranslator --dataset SwissProt --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method BioTranslator --dataset CAFA3 --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings

Run the codes using different baselines/

python Protein/main.py --method BioTranslator --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method ProTranslator --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method TFIDF --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method clusDCA --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method Word2Vec --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings
python Protein/main.py --method Doc2Vec --dataset GOA_Human --data_repo /data/ProteinDataset --task zero_shot --encoder_path model/text_encoder.pth --emb_path /embeddings

Run the codes to perform the few shot prediction task.

python Protein/main.py --method BioTranslator --dataset GOA_Human --data_repo /data/ProteinDataset --task few_shot --encoder_path model/text_encoder.pth --emb_path /embeddings

The results of few shot task with blast will be saved in results/$method$_$dataset$_blast.pkl.

Section 2.5 New Cell Type Discovery

You can run SingleCell/main.py to reproduce results of new cell type annotation.

python SingleCell/main.py --dataset muris_droplet --data_repo /data/sc_data --task same_dataset --encoder_path model/text_encoder.pth --emb_path /embeddings
working_space/  
├── one_dataset/  
│   ├── log/  
│   ├── results/ 
|   └── model/
└── cross_dataset/  
    ├── log/  
    ├── results/ 
    └── model/ 

You can also run the codes to reproduce the results of cross-dataset validation.

python SingleCell/main.py --dataset muris_droplet --data_repo /data/sc_data --task cross_dataset --encoder_path model/text_encoder.pth --emb_path /embeddings

Section 2.6 Pathway Analysis

In this section, we will show how to predict the nodes and links in a pathway. You can run Pathway/main.py to perform pathway analysis.

python Pathway/main.py --pathway_dataset KEGG --dataset GOA_Human --data_repo /data/Protein_Pathway_data/ --encoder_path model/text_encoder.pth --emb_path /embeddings
python Pathway/main.py --pathway_dataset KEGG --dataset GOA_Human --data_repo /data/Protein_Pathway_data/ --encoder_path model/text_encoder.pth --emb_path /embeddings
python Pathway/main.py --pathway_dataset KEGG --dataset GOA_Human --data_repo /data/Protein_Pathway_data/ --encoder_path model/text_encoder.pth --emb_path /embeddings

The authors are trying to make BioTranslator easy-to-use, but it's impossible to include every detail of our algorithm in one document. So if you have any question about the software, feel free to contact us (xuhanwenthu@gmail.com).