bayer-science-for-a-better-life / topefind-public

Finding the pitfalls of deep learning predictors of interacting residues in antibodies 🦠
BSD 3-Clause "New" or "Revised" License
22 stars 0 forks source link
antibodies artificial-intelligence biology deep-learning neural-network protein

Logo Test Badge Coverage Badge License Dashboard

Antibodies are particular Y-shaped proteins. They are one of the most essential parts of the adaptive immune system since they interact with antigens. Being able to engineer antibodies can significantly impact the development of therapeutics, where paratope prediction plays an essential role. Given the antibody's sequence, predicting the paratope means finding the specific amino acids that bind to its target. This repository provides the necessary tools to generate insights into paratope prediction, ultimately guiding model development towards a faster and better production of therapeutics.

Alpha Fold Multimer Contacts ESM2 650M + RF Transfer Learning
color bar alpha fold multimer derived paratope esm2 650m + random forest predicted paratope

6A0Z

6A0Z

Table of Contents

Installation

The installation is currently supported with the usage of a conda environment.
Mambaforge installation is strongly encouraged:
github.com/conda-forge/miniforge

After having installed mamba or conda, clone the repository and create the environment:

# Clone it
git clone https://github.com/bayer-science-for-a-better-life/topefind-public

# Enter
cd topefind-public

# Create environment
mamba env create -f environment.yml

# Activate environment
mamba activate topefind

Dashboard

This repo comes with a dashboard made in Panel which is provided to analyze the results on a test set with some models. It can run on the fly on WASM, be aware that it might take some minutes to install all packages from micropip and fully open. The dashboard can be easily modified and new results can be visualized by cloning the repo and providing a new DataFrame.

https://github.com/bayer-science-for-a-better-life/topefind-public/assets/28055473/c76e4506-0aab-4318-954d-54a89108c206

Motivation

What can you do with better predictions?

Why is it so difficult?

Strange behaviour of current models and transfer learning

| | **Model** | AP (Mean ± Std) | AP@5 (Mean ± Std) | AP@10 (Mean ± Std) | |------------|--------------------|-------------------------------------|-------------------------------------|-------------------------------------| | Baselines | AF2M | 0.32±0.19 | 0.41±0.29 | 0.40±0.24 | | | PosFreq | 0.65±0.21 | 0.37±0.26 | 0.33±0.18 | | End-to-end | Parapred | 0.69±0.22 | 0.82±0.27 | 0.78±0.24 | | | Paragraph Unpaired | 0.74±0.22 | **0.87**±0.25 | **0.83**±0.23 | | LM + RF | ESM2 8M + RF | 0.73±0.21 | 0.85±0.25 | 0.80±0.23 | | | ESM2 35M + RF | 0.74±0.21 | 0.85±0.25 | 0.81±0.23 | | | ESM2 150M+ RF | 0.74±0.21 | 0.84±0.26 | 0.81±0.23 | | | ESM2 650M + RF | **0.75**±0.21 | 0.85±0.25 | 0.81±0.23 | | | ProtT5 XL + RF | **0.75**±0.21 | 0.86±0.25 | 0.81±0.23 | | | RITA XL + RF | 0.70±0.22 | 0.82±0.26 | 0.78±0.24 |

Why is this happening?

| | **Model** | AP (Mean ± Std) | AP@5 (Mean ± Std) | AP@10 (Mean ± Std) | |-----------------------------|-----------------------|----------------------------------|----------------------------------|---------------------------------| | Interpretable Features + RF | AA + RF | 0.24±0.10 | 0.34±0.24 | 0.35±0.17 | | | Pos + RF | 0.65±0.21 | 0.78±0.27 | 0.73±0.24 | | | AA + Pos + RF | 0.71±0.21 | 0.83±0.25 | 0.79±0.22 | | | AA + Pos + 3CTX + RF | 0.72±0.21 | 0.84±0.24 | 0.80±0.22 | | | AA + Pos + 5CTX + RF | 0.72±0.21 | 0.84±0.25 | 0.80±0.23 | | | AA + Pos + 7CTX + RF | 0.73±0.21 | 0.84±0.25 | 0.80±0.23 | | | AA + Pos + 11CTX + RF | 0.73±0.22 | 0.85±0.24 | 0.81±0.23 | | | AA + Pos + 17CTX + RF | 0.74±0.21 | 0.85±0.24 | 0.81±0.23 | | | AA + Pos + 23CTX + RF | 0.74±0.21 | 0.85±0.25 | 0.81±0.23 |
![](resources/images/bivariate_absolute_error.png "Bivariate distribution of absolute error")

Open questions?

Usage

The topefind package contains several subpackages:

And several core modules:

Additionally:

Get a parsed and labelled version of SAbDab

To do this we need to use the data_hub.py.
This will create a local copy of SAbDab in datasets, and it will parse it into a convenient parquet file. It might take a while, currently SAbDab is ~6GB. Increase the number of jobs (n_jobs) for quicker parsing.

from topefind.data_hub import SabdabHub

df = SabdabHub(n_jobs=1)()

Create a paratope dataset to play with

# Select some columns of interest
df = df[[
    "pdb",
    "antibody_sequence",
    "antibody_imgt",
    "antibody_chain",
    "chain_type",
    "resolution",
    "scfv",
    "antigen_sequence",
    "antigen_chain",
    "antigen_type",
    "num_antigen_chains",
    "full_paratope_labels",
]]

# Let's filter according to some literature guidelines.
df = df.drop_duplicates("antibody_sequence")  # Don't bias the model.
df = df[(df.antibody_sequence.str.len() > 70) & (df.antibody_sequence.str.len() < 200)]  # Don't go < 70 ...ANARCI.
df = df[df.full_paratope_labels.apply(sum) >= 1]  # At least some positives.
df = df[(df.num_antigen_chains > 0) & (df.num_antigen_chains <= 3)]  # Follows the choice in Paragraph.
df = df[~df.scfv]  # Hard to deal with since two chains are connected.
df = df[df.antigen_type.isin(["protein", "peptide"])]
df = df[df.resolution < 3]  # Allows to define contacts above this resolution (used everywhere in literature).
df = df.reset_index()

# Done, a working dataset.
print(f"Dataset now contains {len(df)} entries")
print(f"{len(df[df.num_antigen_chains > 1])} entries are connected to multiple antigens")

sequences = df["antibody_sequence"].to_list()
labels = df["full_paratope_labels"].to_list()

test_sequence = sequences.pop()
test_label = labels.pop()

Let's train a classifier on top of ESM embeddings and predict the paratope


# Feel free to switch to ESM2 bigger models available under EmbedderName.
# Check `topefind.predictors` and `topefind.embedders`and choose your desired configuration.
import tabulate
import numpy as np
from joblib import Memory
from sklearn.ensemble import RandomForestClassifier

from topefind.predictors import PLMSKClassifier
from topefind.embedders import EmbedderName, ESMEmbedder

# Cache the model for further usage
memory = Memory("cache", verbose=0)
seed = 42

@memory.cache
def build_esm_rf(
        train_sequences,
        train_labels,
        emb_name=EmbedderName.esm2_8m,
        n_estimators=128
):
    return PLMSKClassifier(
        ESMEmbedder(emb_name),
        RandomForestClassifier(n_estimators=n_estimators, n_jobs=4, random_state=seed),
    ).train(train_sequences, train_labels)

esm_rf = build_esm_rf(sequences, labels)
preds = esm_rf.predict(test_sequence)
trues = np.array(test_label)
results = np.hstack(trues, preds).T

print(tabulate(results, headers=["labels", "predictions"], tablefmt="fancy_grid", floatfmt=".2f"))

Predict the paratope with an end-to-end model


from topefind.predictors import Parapred

parapred = Parapred()

preds = parapred.predict(test_sequence)
trues = np.array(test_label)
results = np.hstack(trues, preds).T

print(tabulate(results, headers=["labels", "predictions"], tablefmt="fancy_grid", floatfmt=".2f"))

Models

Some relevant models from the literature are reported for comparison.

| **Model Type** | **Model Name** | **Original Repository** | **DOI** | **Weights availability** | **Vendored**a | **Available for inference in topefind** | |----------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|:------------------------:|:------------------------:|:---------------------------------------:| | End-to-end | Paragraph | [oxpig/Paragraph](https://github.com/oxpig/Paragraph) | [2022 Chinery et al.](https://doi.org/10.1093/bioinformatics/btac732) | ✅ | ✅ | ✅ | | | PECAN | [vamships/PECAN](https://github.com/vamships/PECAN) | [2020 Pittala et al.](https://doi.org/10.1093/bioinformatics/btaa263) | 🚫 | ✅ | 🚫c | | | Parapred
(pytorch)b | [alchemab/parapred-pytorch](https://github.com/alchemab/parapred-pytorch)
[github.com/eliberis/parapred](https://github.com/eliberis/parapred) | [2018 Liberis et al.](https://doi.org/10.1093/bioinformatics/bty305) | ✅ | ✅ | ✅ | | | antiBERTa | [alchemab/antiberta](https://github.com/alchemab/antiberta) | [2022 Leem et al.](https://doi.org/10.1016/j.patter.2022.100513) | 🚫 | ✅ | 🚫d | | SSL | ESM | [facebookresearch/esm](https://github.com/facebookresearch/esm) | [2021 Rives et al.](https://doi.org/10.1073/pnas.2016239118) | ✅ | 🚫 | ✅ | | | ProtT5 | [agemagician/ProtTrans](https://github.com/agemagician/ProtTrans) | [2021 Elnaggar et al.](https://doi.org/10.1109/TPAMI.2021.3095381) | ✅ | 🚫 | ✅ | | | RITA | [lightonai/RITA](https://github.com/lightonai/RITA) | [2022 Hesslow et al.](https://doi.org/10.48550/arXiv.2205.05789) | ✅ | 🚫 | ✅ | | | IgLM | [Graylab/IgLM](https://github.com/Graylab/IgLM) | [2022 Shuai et al.](https://doi.org/10.1101/2021.12.13.472419) | ✅ | ✅ | 🚫e |

a check Vendored
b topefind uses parapred-pytorch provided by alchemab. The original version in Keras is provided by eliberis and is not considered in topefind for inference.

Some relevant models in the literature are not included for the following reasons:
c Weights not available.
d Weights not available.
e License not permissive enough.

Literature Datasets Summary

We report some relevant datasets and a brief summary of the pre-processing done to each one to get an overview of the current landscape. Check each model's paper reported in Models for more details.

Method Reported PR AUC Dataset Availability Dataset Size Dataset Pre-Processing
PECAN 0.68a 460 - Get the dataset from Darberdaku et Ferrari
- Cluster at 95% maximum sequence identity
- Discard complexes with antigens different than protein e.g. DNA
Paragraph 0.70a 460 - Get the PECAN dataset.
Paragraph Expanded 0.73a 1086 - Get X-ray structures from SAbDab with a resolution < 3.0 A
- Discard Fvs that contain less than 10 binding residues
- Discard antibodies that have less than 50% of the binding residues in the CDR+2 region
- Cluster at 95% maximum sequence identity with CD-HIT
- Remove structures that ABodyBuilder can not model using 95% identity threshold
Parapredc 0.65a
0.72b
277 - Get X-ray structures from SAbDab with a resolution < 3.0 A
- Keep only complexes that contain both heavy and light chains
- Cluster at 95% maximum sequence identity (with CD-HIT)f
- Discard antibodies with < 5 residues in contact with the antigen
antiBERTa 0.74b ⚠️d 900e - Get X-ray structures from SAbDab with a resolution < 3.0 A
- Identify paratope labels as any residue of the antibody chain that has a heavy atom within 4.5 A of a heavy atom in an antigen chain
- Include only antibodies with at least one contact in the heavy chain and one in the light chain
- Cluster at 99% maximum sequence identity with CD-HIT
- Perform V-gene hierarchical clustering on antibody chains and remove the clusters with < 3 antibodies
- Within each cluster, bin paratope lengths and remove antibody chains corresponding to bins with < 3 counts

a Reported by Paragraph.
b Reported by antiBERTa.
c We use the parapred-pytorch adapted weights.
d The script to download the dataset for pre-training the antibody LM is given but is not functional. The transfer-learning dataset is given but lacks the differentiation between heavy and light chains.
e Accounts only in this case for the total number of antibody chains and not complexes.
f Not explicitly stated by Parapred.

Vendored

Some related works are vendored for simplicity and reproducibility.
The vendored folder contains such software with its necessary LICENSE provided by the authors.
Weights of trained models are stored in each vendored package, if provided by the original authors.
We applied minor modifications to parts of the code to adapt for each use case.
Please refer to each package for more details.

Additionally, a small part of panel_chemistry is vendored for the dashboard.

Acknowledgements

Built with love ❤️ by Serban Cristian Tudosie during an internship at Bayer 🧬.
Several more people have contributed: Santiago Villalba, Pedro Reis, Adrien Bitton, Sven Giese.