ProtTrans

ProtTrans is providing state of the art pre-trained models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using various Transformer models.

Have a look at our paper ProtTrans: cracking the language of life’s code through self-supervised deep learning and high performance computing for more information about our work.

ProtTrans Attention Visualization

This repository will be updated regulary with new pre-trained models for proteins as part of supporting bioinformatics community in general, and Covid-19 research specifically through our Accelerate SARS-CoV-2 research with transfer learning using pre-trained language modeling models project.

⌛️ News
🚀 Installation
🚀 Quick Start
⌛️ Models Availability
⌛️ Dataset Availability
🚀 Usage
📊 Original downstream Predictions
📊 Followup use-cases
📊 Comparisons to other tools
❤️ Community and Contributions
📫 Have a question?
🤝 Found a bug?
✅ Requirements
🤵 Team
💰 Sponsors
📘 License
✏️ Citation

⌛️ News

2023/07/14: FineTuning with LoRA provides a notebooks on how to fine-tune ProtT5 on both, per-residue and per-protein tasks, using Low-Rank Adaptation (LoRA) for efficient finetuning (thanks @0syrys !).
2022/11/18: Availability: LambdaPP offers a simple web-service to access ProtT5-based predictions and UniProt now offers to download pre-computed ProtT5 embeddings for a subset of selected organisms.

🚀 Installation

All our models are available via huggingface/transformers:

pip install torch
pip install transformers
pip install sentencepiece

For more details, please follow the instructions for transformers installations.

A recently introduced change in the T5-tokenizer results in UnboundLocalError: cannot access local variable 'sentencepiece_model_pb2 and can either be fixed by installing this PR or by manually installing:

pip install protobuf

If you are using a transformer version after this PR, you will see this warning. Explicitly setting legacy=True will result in expected behavor and will avoid the warning. You can also safely ignore the warning as legacy=True is the default.

🚀 Quick Start

Example for how to derive embeddings from our best-performing protein language model, ProtT5-XL-U50 (aka ProtT5); also available as colab:

from transformers import T5Tokenizer, T5EncoderModel
import torch
import re

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.to(torch.float32) if device==torch.device("cpu")

# prepare your protein sequences as a list
sequence_examples = ["PRTEINO", "SEQWENCE"]

# replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="longest")

input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

We also have a script which simplifies deriving per-residue and per-protein embeddings from ProtT5 for a given FASTA file:

python prott5_embedder.py --input sequences/some.fasta --output embeddings/residue_embeddings.h5
python prott5_embedder.py --input sequences/some.fasta --output embeddings/protein_embeddings.h5 --per_protein 1

⌛️ Models Availability

Model	Hugging Face	Zenodo	Colab
ProtT5-XL-UniRef50 (also ProtT5-XL-U50)	Download	Download	Colab
ProtT5-XL-BFD	Download	Download
ProtT5-XXL-UniRef50	Download	Download
ProtT5-XXL-BFD	Download	Download
ProtBert-BFD	Download	Download
ProtBert	Download	Download
ProtAlbert	Download	Download
ProtXLNet	Download	Download
ProtElectra-Generator-BFD	Download	Download
ProtElectra-Discriminator-BFD	Download	Download

⌛️ Datasets Availability

Dataset	Dropbox
NEW364	Download
Netsurfp2	Download
CASP12	Download
CB513	Download
TS115	Download
DeepLoc Train	Download
DeepLoc Test	Download

🚀 Usage

How to use ProtTrans:

🧬 Feature Extraction (FE):
Please check: Embedding Section. Colab example for feature extraction via ProtT5-XL-U50

🚀 Logits Extraction:
For ProtT5-logits extraction, please check: VESPA logits script.

💥 Fine Tuning (FT):
Please check: Fine Tuning Section. More information coming soon.

🧠 Prediction:
Please check: Prediction Section. Colab example for secondary structure prediction via ProtT5-XL-U50 and Colab example for subcellular localization prediction as well as differentiation between membrane-bound and water-soluble proteins via ProtT5-XL-U50.

⚗️ Protein Sequences Generation:
Please check: Generate Section. More information coming soon.

🧐 Visualization:
Please check: Visualization Section. More information coming soon.

📈 Benchmark:
Please check: Benchmark Section. More information coming soon.

📊 Original downstream Predictions

🧬 Secondary Structure Prediction (Q3):

Model	CASP12	TS115	CB513
ProtT5-XL-UniRef50	81	87	86
ProtT5-XL-BFD	77	85	84
ProtT5-XXL-UniRef50	79	86	85
ProtT5-XXL-BFD	78	85	83
ProtBert-BFD	76	84	83
ProtBert	75	83	81
ProtAlbert	74	82	79
ProtXLNet	73	81	78
ProtElectra-Generator	73	78	76
ProtElectra-Discriminator	74	81	79
ProtTXL	71	76	74
ProtTXL-BFD	72	75	77

🆕 Predict your sequence live on predictprotein.org.

🧬 Secondary Structure Prediction (Q8):

Model	CASP12	TS115	CB513
ProtT5-XL-UniRef50	70	77	74
ProtT5-XL-BFD	66	74	71
ProtT5-XXL-UniRef50	68	75	72
ProtT5-XXL-BFD	66	73	70
ProtBert-BFD	65	73	70
ProtBert	63	72	66
ProtAlbert	62	70	65
ProtXLNet	62	69	63
ProtElectra-Generator	60	66	61
ProtElectra-Discriminator	62	69	65
ProtTXL	59	64	59
ProtTXL-BFD	60	65	60

🆕 Predict your sequence live on predictprotein.org.

🧬 Membrane-bound vs Water-soluble (Q2):

Model	DeepLoc
ProtT5-XL-UniRef50	91
ProtT5-XL-BFD	91
ProtT5-XXL-UniRef50	89
ProtT5-XXL-BFD	90
ProtBert-BFD	89
ProtBert	89
ProtAlbert	88
ProtXLNet	87
ProtElectra-Generator	85
ProtElectra-Discriminator	86
ProtTXL	85
ProtTXL-BFD	86

🧬 Subcellular Localization (Q10):

Model	DeepLoc
ProtT5-XL-UniRef50	81
ProtT5-XL-BFD	77
ProtT5-XXL-UniRef50	79
ProtT5-XXL-BFD	77
ProtBert-BFD	74
ProtBert	74
ProtAlbert	74
ProtXLNet	68
ProtElectra-Generator	59
ProtElectra-Discriminator	70
ProtTXL	66
ProtTXL-BFD	65

📊 Use-cases

Level	Type	Tool	Task	Manuscript	Webserver
Protein	Function	Light Attention	Subcellular localization	Light attention predicts protein location from the language of life	(Web-server)
Residue	Function	bindEmbed21	Binding Residues	Protein embeddings and deep learning predict binding residues for various ligand classes	(Coming soon)
Residue	Function	VESPA	Conservation & effect of Single Amino Acid Variants (SAVs)	Embeddings from protein language models predict conservation and variant effects	(coming soon)
Protein	Structure	ProtTucker	Protein 3D structure similarity prediction	Contrastive learning on protein embeddings enlightens midnight zone at lightning speed
Residue	Structure	ProtT5dst	Protein 3D structure prediction	Protein language model embeddings for fast, accurate, alignment-free protein structure prediction

📊 Comparison to other protein language models (pLMs)

While developing the use-cases, we compared ProtTrans models to other protein language models, for instance the ESM models. To focus on the effect of changing input representaitons, the following comparisons use the same architectures on top on different embedding inputs.

Task/Model	ProtBERT-BFD	ProtT5-XL-U50	ESM-1b	ESM-1v	Metric	Reference
Subcell. loc. (setDeepLoc)	80	86	83	-	Accuracy	Light-attention
Subcell. loc. (setHard)	58	65	62	-	Accuracy	Light-attention
Conservation (ConSurf-DB)	0.540	0.596	0.563	-	MCC	ConsEmb
Variant effect (DMS-data)	-	0.53	-	0.49	Spearman (Mean)	VESPA
Variant effect (DMS-data)	-	0.53	-	0.53	Spearman (Median)	VESPA
CATH superfamily (unsup.)	18	64	57	-	Accuracy	ProtTucker
CATH superfamily (sup.)	39	76	70	-	Accuracy	ProtTucker
Binding residues	-	39	32	-	F1	bindEmbed21

Important note on ProtT5-XL-UniRef50 (dubbed ProtT5-XL-U50): all performances were measured using only embeddings extracted from the encoder-side of the underlying T5 model as described here. Also, experiments were ran in half-precision mode (model.half()), to speed-up embedding generation. No performance degradation could be observed in any of the experiments when running in half-precision.

❤️ Community and Contributions

The ProtTrans project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.

📫 Have a question?

We are happy to hear your question in our issues page ProtTrans! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via our RostLab email

🤝 Found a bug?

Feel free to file a new issue with a respective title and description on the the ProtTrans repository. If you already found a solution to your problem, we would love to review your pull request!.

✅ Requirements

For protein feature extraction or fine-tuninng our pre-trained models, Pytorch and Transformers library from huggingface is needed. For model visualization, you need to install BertViz library.

🤵 Team

Technical University of Munich:

Ahmed Elnaggar	Michael Heinzinger	Christian Dallago	Ghalia Rehawi	Burkhard Rost

Med AI Technology:

Yu Wang

Google:

Llion Jones

Nvidia:

Tom Gibbs	Tamas Feher	Christoph Angerer

Seoul National University:

Martin Steinegger

ORNL:

Debsindhu Bhowmik

💰 Sponsors

Nvidia	Google	Google	ORNL	Software Campus

📘 License

The ProtTrans pretrained models are released under the under terms of the Academic Free License v3.0 License.

✏️ Citation

If you use this code or our pretrained models for your publication, please cite the original paper:

@ARTICLE
{9477085,
author={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Yu, Wang and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and Bhowmik, Debsindhu and Rost, Burkhard},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing},
year={2021},
volume={},
number={},
pages={1-1},
doi={10.1109/TPAMI.2021.3095381}}

agemagician / ProtTrans

readme