JSchlensok / VespaG

Expert-Guided Protein Language Models enable Accurate and Blazingly Fast Fitness Prediction
GNU General Public License v3.0
8 stars 3 forks source link

VespaG: Expert-Guided Protein Language Models enable Accurate and Blazingly Fast Fitness Prediction

image

VespaG is a blazingly fast single amino acid variant effect predictor, leveraging embeddings of the protein language model ESM-2 (Lin et al. 2022) as input to a minimal deep learning model.

To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from a subset of the Human proteome, which we then annotated using predictions from the multiple sequence alignment-based effect predictor GEMME (Laine et al. 2019) as a proxy for experimental scores.

Assessed on the ProteinGym (Notin et al. 2023) benchmark, VespaG matches state-of-the-art methods while being several orders of magnitude faster, predicting the entire single-site mutational landscape for a human proteome in under a half hour on a consumer-grade laptop.

More details on VespaG can be found in the corresponding preprint.

Installation

  1. conda env create -n vespag python==3.10 poetry==1.8.3 (exchange conda for mamba, miniconda or micromamba as you like)
  2. conda activate vespag
  3. export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
  4. poetry install

Quick Start: Running Inference with VespaG

Run python -m vespag predict with the following options:
Required:

Examples

After installing the dependencies above and cloning the VespaG repo, you can try out the following examples:

Re-training VespaG

VespaG uses DVC for pipeline orchestration and WandB for experiment tracking.

Using WandB is optional; a username and project for WandB can be specified in params.yaml.

Using DVC is non-optional. There is a dvc.yaml file in place that contains stages for generating pLM embeddings from FASTA files, but you can also download pre-computed embeddings and GEMME scores from our Zenodo repository. Adjust paths in params.yaml to your context, and feel free to play around with model parameters. You can simply run a training run using dvc repro -s train@<model_type>-{esm2|prott5}-<dataset>, with <model_type> and <dataset> each corresponding to a named block in params.yaml.

Evaluation

You can reproduce our evaluation using the eval subcommand, which pre-processes data into a format usable by VespaG, runs predict, and computes performance metrics.

ProteinGym217

Based on the ProteinGym (Notin et al. 2023) DMS substitutions benchmark, dubbed ProteinGym217 by us. Run it with python -m vespag eval proteingym, with the following options: Optional:

Preprint Citation

If you find VespaG helpful in your work, please be so kind as to cite our pre-print:

@article{vespag,
    author = {Celine Marquet and Julius Schlensok and Marina Abakarova and Burkhard Rost and Elodie Laine},
    title = {VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction},
    year = {2024},
    doi = {10.1101/2024.04.24.590982},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/04/28/2024.04.24.590982},
    journal = {bioRxiv}}