Antibodies are particular Y-shaped proteins. They are one of the most essential parts of the adaptive immune system since they interact with antigens. Being able to engineer antibodies can significantly impact the development of therapeutics, where paratope prediction plays an essential role. Given the antibody's sequence, predicting the paratope means finding the specific amino acids that bind to its target. This repository provides the necessary tools to generate insights into paratope prediction, ultimately guiding model development towards a faster and better production of therapeutics.

	Alpha Fold Multimer Contacts	ESM2 650M + RF Transfer Learning

	6A0Z	6A0Z

Installation

The installation is currently supported with the usage of a conda environment.
Mambaforge installation is strongly encouraged:
github.com/conda-forge/miniforge

After having installed mamba or conda, clone the repository and create the environment:

# Clone it
git clone https://github.com/bayer-science-for-a-better-life/topefind-public

# Enter
cd topefind-public

# Create environment
mamba env create -f environment.yml

# Activate environment
mamba activate topefind

Dashboard

This repo comes with a dashboard made in Panel which is provided to analyze the results on a test set with some models. It can run on the fly on WASM, be aware that it might take some minutes to install all packages from micropip and fully open. The dashboard can be easily modified and new results can be visualized by cloning the repo and providing a new DataFrame.

https://github.com/bayer-science-for-a-better-life/topefind-public/assets/28055473/c76e4506-0aab-4318-954d-54a89108c206

Motivation

What can you do with better predictions?

A more accurate prediction of the paratope, allows, e.g. the in-silico optimization of potential antibody candidates by first predicting the paratope and then optimizing on those specific amino acids of the paratope, or vice versa, keeping them fixed to enhance other properties such as immunogenicity, stability, and humanisation.

Why is it so difficult?

The structure of complementarity-determining regions (CDRs) is not preserved, especially in HCDR3, resulting in an almost uniform random pattern to learn in those regions; and light chains have a reduced amount of paratope residues, ca. 5, as opposed to ca. 10 in the heavy chain, resulting in a greater imbalance setting.
Predictions must be close-to-perfect in an out-of-distribution setting to assess the method's practicality. Consider mutating residues in non-paratope-predicted regions to optimize a property of the antibody while keeping the predicted paratope fixed: if the predictions were not to be close-to-perfect, the mutations would result in modifying the true paratope, disrupting affinity. Given the above-mentioned paratope size, one wrong mutation can be detrimental to affinity by 20% in light chains and 10\% in heavy chains. This is why the aim of the scoring metrics is much higher for a practical usage of the predictors in this setting/task.

Strange behaviour of current models and transfer learning

The models reach a ceiling in performance on test sets (Expanded Paragraph Test Set).
Surprisingly, the model capacity does not benefit significantly the downstream task, which contradicts downstream tasks behaviours on single protein properties, e.g. secondary structure prediction or contact prediction in ESM2 models.

| | **Model** | AP (Mean ± Std) | AP@5 (Mean ± Std) | AP@10 (Mean ± Std) | |------------|--------------------|-------------------------------------|-------------------------------------|-------------------------------------| | Baselines | AF2M | 0.32_^±0.19 | 0.41_^±0.29 | 0.40_^±0.24 | | | PosFreq | 0.65_^±0.21 | 0.37_^±0.26 | 0.33_^±0.18 | | End-to-end | Parapred | 0.69_^±0.22 | 0.82_^±0.27 | 0.78_^±0.24 | | | Paragraph Unpaired | 0.74_^±0.22 | **0.87**_^±0.25 | **0.83**_^±0.23 | | LM + RF | ESM2 8M + RF | 0.73_^±0.21 | 0.85_^±0.25 | 0.80_^±0.23 | | | ESM2 35M + RF | 0.74_^±0.21 | 0.85_^±0.25 | 0.81_^±0.23 | | | ESM2 150M+ RF | 0.74_^±0.21 | 0.84_^±0.26 | 0.81_^±0.23 | | | ESM2 650M + RF | **0.75**_^±0.21 | 0.85_^±0.25 | 0.81_^±0.23 | | | ProtT5 XL + RF | **0.75**_^±0.21 | 0.86_^±0.25 | 0.81_^±0.23 | | | RITA XL + RF | 0.70_^±0.22 | 0.82_^±0.26 | 0.78_^±0.24 |

Why is this happening?

The models learn mostly the bivariate distribution of position and amino acid type (with some extra context of nearby residues).
Training a random forest on top of such simple features highlights the problem.

| | **Model** | AP (Mean ± Std) | AP@5 (Mean ± Std) | AP@10 (Mean ± Std) | |-----------------------------|-----------------------|----------------------------------|----------------------------------|---------------------------------| | Interpretable Features + RF | AA + RF | 0.24_^±0.10 | 0.34_^±0.24 | 0.35_^±0.17 | | | Pos + RF | 0.65_^±0.21 | 0.78_^±0.27 | 0.73_^±0.24 | | | AA + Pos + RF | 0.71_^±0.21 | 0.83_^±0.25 | 0.79_^±0.22 | | | AA + Pos + 3CTX + RF | 0.72_^±0.21 | 0.84_^±0.24 | 0.80_^±0.22 | | | AA + Pos + 5CTX + RF | 0.72_^±0.21 | 0.84_^±0.25 | 0.80_^±0.23 | | | AA + Pos + 7CTX + RF | 0.73_^±0.21 | 0.84_^±0.25 | 0.80_^±0.23 | | | AA + Pos + 11CTX + RF | 0.73_^±0.22 | 0.85_^±0.24 | 0.81_^±0.23 | | | AA + Pos + 17CTX + RF | 0.74_^±0.21 | 0.85_^±0.24 | 0.81_^±0.23 | | | AA + Pos + 23CTX + RF | 0.74_^±0.21 | 0.85_^±0.25 | 0.81_^±0.23 |

The bivariate distribution of absolute error given position and amino acid type assures that the models fail similarly:

![](resources/images/bivariate_absolute_error.png "Bivariate distribution of absolute error")

Open questions?

Is this a problem of the classifier only and not the feature extractor?
If the classifier is the problem, can we overcome the problem by weighting the loss given the frequency of the amino acids to not let the model fail into a local minima?

Usage

The topefind package contains several subpackages:

exploration: exploration code to answer questions (quick and dirty)
dashboard: visualization of results
vendored: vendored repositories

And several core modules:

embedders.py: embedders for protein sequences
predictors.py: paratope predictors (uses embedders)
benchmarkers.py: benchmarks predictors (uses embedders and predictors)
utils.py: utilities, plotting and necessary globals

Additionally:

datasets is used to store datasets.
models is used to store models.
examples showcases usages.

Get a parsed and labelled version of SAbDab

To do this we need to use the data_hub.py.
This will create a local copy of SAbDab in datasets, and it will parse it into a convenient parquet file. It might take a while, currently SAbDab is ~6GB. Increase the number of jobs (n_jobs) for quicker parsing.

from topefind.data_hub import SabdabHub

df = SabdabHub(n_jobs=1)()

Create a paratope dataset to play with

# Select some columns of interest
df = df[[
    "pdb",
    "antibody_sequence",
    "antibody_imgt",
    "antibody_chain",
    "chain_type",
    "resolution",
    "scfv",
    "antigen_sequence",
    "antigen_chain",
    "antigen_type",
    "num_antigen_chains",
    "full_paratope_labels",
]]

# Let's filter according to some literature guidelines.
df = df.drop_duplicates("antibody_sequence")  # Don't bias the model.
df = df[(df.antibody_sequence.str.len() > 70) & (df.antibody_sequence.str.len() < 200)]  # Don't go < 70 ...ANARCI.
df = df[df.full_paratope_labels.apply(sum) >= 1]  # At least some positives.
df = df[(df.num_antigen_chains > 0) & (df.num_antigen_chains <= 3)]  # Follows the choice in Paragraph.
df = df[~df.scfv]  # Hard to deal with since two chains are connected.
df = df[df.antigen_type.isin(["protein", "peptide"])]
df = df[df.resolution < 3]  # Allows to define contacts above this resolution (used everywhere in literature).
df = df.reset_index()

# Done, a working dataset.
print(f"Dataset now contains {len(df)} entries")
print(f"{len(df[df.num_antigen_chains > 1])} entries are connected to multiple antigens")

sequences = df["antibody_sequence"].to_list()
labels = df["full_paratope_labels"].to_list()

test_sequence = sequences.pop()
test_label = labels.pop()

Let's train a classifier on top of ESM embeddings and predict the paratope


# Feel free to switch to ESM2 bigger models available under EmbedderName.
# Check `topefind.predictors` and `topefind.embedders`and choose your desired configuration.
import tabulate
import numpy as np
from joblib import Memory
from sklearn.ensemble import RandomForestClassifier

from topefind.predictors import PLMSKClassifier
from topefind.embedders import EmbedderName, ESMEmbedder

# Cache the model for further usage
memory = Memory("cache", verbose=0)
seed = 42

@memory.cache
def build_esm_rf(
        train_sequences,
        train_labels,
        emb_name=EmbedderName.esm2_8m,
        n_estimators=128
):
    return PLMSKClassifier(
        ESMEmbedder(emb_name),
        RandomForestClassifier(n_estimators=n_estimators, n_jobs=4, random_state=seed),
    ).train(train_sequences, train_labels)

esm_rf = build_esm_rf(sequences, labels)
preds = esm_rf.predict(test_sequence)
trues = np.array(test_label)
results = np.hstack(trues, preds).T

print(tabulate(results, headers=["labels", "predictions"], tablefmt="fancy_grid", floatfmt=".2f"))

Predict the paratope with an end-to-end model


from topefind.predictors import Parapred

parapred = Parapred()

preds = parapred.predict(test_sequence)
trues = np.array(test_label)
results = np.hstack(trues, preds).T

print(tabulate(results, headers=["labels", "predictions"], tablefmt="fancy_grid", floatfmt=".2f"))

Models

Some relevant models from the literature are reported for comparison.

| **Model Type** | **Model Name** | **Original Repository** | **DOI** | **Weights availability** | **Vendored**^a | **Available for inference in topefind** | |----------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|:------------------------:|:------------------------:|:---------------------------------------:| | End-to-end | Paragraph | [oxpig/Paragraph](https://github.com/oxpig/Paragraph) | [2022 Chinery et al.](https://doi.org/10.1093/bioinformatics/btac732) | ✅ | ✅ | ✅ | | | PECAN | [vamships/PECAN](https://github.com/vamships/PECAN) | [2020 Pittala et al.](https://doi.org/10.1093/bioinformatics/btaa263) | 🚫 | ✅ | 🚫^c | | | Parapred
(pytorch)^b | [alchemab/parapred-pytorch](https://github.com/alchemab/parapred-pytorch)
[github.com/eliberis/parapred](https://github.com/eliberis/parapred) | [2018 Liberis et al.](https://doi.org/10.1093/bioinformatics/bty305) | ✅ | ✅ | ✅ | | | antiBERTa | [alchemab/antiberta](https://github.com/alchemab/antiberta) | [2022 Leem et al.](https://doi.org/10.1016/j.patter.2022.100513) | 🚫 | ✅ | 🚫^d | | SSL | ESM | [facebookresearch/esm](https://github.com/facebookresearch/esm) | [2021 Rives et al.](https://doi.org/10.1073/pnas.2016239118) | ✅ | 🚫 | ✅ | | | ProtT5 | [agemagician/ProtTrans](https://github.com/agemagician/ProtTrans) | [2021 Elnaggar et al.](https://doi.org/10.1109/TPAMI.2021.3095381) | ✅ | 🚫 | ✅ | | | RITA | [lightonai/RITA](https://github.com/lightonai/RITA) | [2022 Hesslow et al.](https://doi.org/10.48550/arXiv.2205.05789) | ✅ | 🚫 | ✅ | | | IgLM | [Graylab/IgLM](https://github.com/Graylab/IgLM) | [2022 Shuai et al.](https://doi.org/10.1101/2021.12.13.472419) | ✅ | ✅ | 🚫^e |

^a check Vendored
^b topefind uses parapred-pytorch provided by alchemab. The original version in Keras is provided by eliberis and is not considered in topefind for inference.

Some relevant models in the literature are not included for the following reasons:
^c Weights not available.
^d Weights not available.
^e License not permissive enough.

Literature Datasets Summary

We report some relevant datasets and a brief summary of the pre-processing done to each one to get an overview of the current landscape. Check each model's paper reported in Models for more details.

Method	Reported PR AUC	Dataset Availability	Dataset Size	Dataset Pre-Processing
PECAN	0.68^a	✅	460	- Get the dataset from Darberdaku et Ferrari - Cluster at 95% maximum sequence identity - Discard complexes with antigens different than protein e.g. DNA
Paragraph	0.70^a	✅	460	- Get the PECAN dataset.
Paragraph Expanded	0.73^a	✅	1086	- Get X-ray structures from SAbDab with a resolution < 3.0 A - Discard Fvs that contain less than 10 binding residues - Discard antibodies that have less than 50% of the binding residues in the CDR+2 region - Cluster at 95% maximum sequence identity with CD-HIT - Remove structures that ABodyBuilder can not model using 95% identity threshold
Parapred^c	0.65^a 0.72^b	✅	277	- Get X-ray structures from SAbDab with a resolution < 3.0 A - Keep only complexes that contain both heavy and light chains - Cluster at 95% maximum sequence identity (with CD-HIT)^f - Discard antibodies with < 5 residues in contact with the antigen
antiBERTa	0.74^b	⚠️^d	900^e	- Get X-ray structures from SAbDab with a resolution < 3.0 A - Identify paratope labels as any residue of the antibody chain that has a heavy atom within 4.5 A of a heavy atom in an antigen chain - Include only antibodies with at least one contact in the heavy chain and one in the light chain - Cluster at 99% maximum sequence identity with CD-HIT - Perform V-gene hierarchical clustering on antibody chains and remove the clusters with < 3 antibodies - Within each cluster, bin paratope lengths and remove antibody chains corresponding to bins with < 3 counts

^a Reported by Paragraph.
^b Reported by antiBERTa.
^c We use the parapred-pytorch adapted weights.
^d The script to download the dataset for pre-training the antibody LM is given but is not functional. The transfer-learning dataset is given but lacks the differentiation between heavy and light chains.
^e Accounts only in this case for the total number of antibody chains and not complexes.
^f Not explicitly stated by Parapred.

Vendored

Some related works are vendored for simplicity and reproducibility.
The vendored folder contains such software with its necessary LICENSE provided by the authors.
Weights of trained models are stored in each vendored package, if provided by the original authors.
We applied minor modifications to parts of the code to adapt for each use case.
Please refer to each package for more details.

Additionally, a small part of panel_chemistry is vendored for the dashboard.

Acknowledgements

Built with love ❤️ by Serban Cristian Tudosie during an internship at Bayer 🧬.
Several more people have contributed: Santiago Villalba, Pedro Reis, Adrien Bitton, Sven Giese.

bayer-science-for-a-better-life / topefind-public

readme

Table of Contents