This repo contains inference code for "PoET: A generative model of protein families as sequences-of-sequences", a state-of-the-art protein language model for variant effect prediction and conditional sequence generation.
mamba
(faster alternative to conda
) installed (Instructions)conda-lock
installed in your base conda/mamba environment (Instructions)make create_conda_env
. This will create a conda environment named poet
.make download_model
to download the model (~400MB). The model will be located at data/poet.ckpt
. Please note the license.Use the script scripts/score.py
to obtain fitness scores for a list of protein variants given a MSA of homologs of the WT sequence.
poet
conda environmentRun the script, replacing the values in angle brackets with the appropriate paths.
python scripts/score.py \
--msa_a3m_path <path to MSA of homologs of WT sequence> \
--variants_fasta_path <path to fasta file containing variants to score> \
--output_npy_path <path to output file where scores for each variant will be stored as a numpy array>
You can pass a lower value for the batch size (--batch_size
) if you run out of VRAM. The script was tested on an A100 GPU with 40GB VRAM.
Run the scoring script without arguments python scripts/score.py
to score variants in the BLAT_ECOLX_Jacquier_2013
dataset from ProteinGym.
data/BLAT_ECOLX_Jacquier_2013.csv
data/BLAT_ECOLX_Jacquier_2013_variants.fasta
data/BLAT_ECOLX_ColabFold_2202.a3m
data/BLAT_ECOLX_Jacquier_2013_variants.npy
The scores obtained from the script should obtain >0.65
Spearman correlation with the measured fitness (DMS_score column in the dataset file).
You may cite the paper as
@inproceedings{NEURIPS2023_f4366126,
author = {Truong Jr, Timothy and Bepler, Tristan},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {77379--77415},
publisher = {Curran Associates, Inc.},
title = {PoET: A generative model of protein families as sequences-of-sequences},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/f4366126eba252699b280e8f93c0ab2f-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
The PoET model weights (DOI: 10.5281/zenodo.10061322
) are available under the CC BY-NC-SA 4.0 license for academic use only. The license can also be found in the LICENSE file provided with the model weights. For commercial use, please reach out to us at contact@ne47.bio about licensing. Copyright (c) NE47 Bio, Inc. All Rights Reserved.