brianhie / efficient-evolution

Efficient evolution from protein language models
MIT License
179 stars 42 forks source link

How to understand the function reconstruct_multi_models() and what it does to the model. #38

Closed ligeng-k closed 3 months ago

ligeng-k commented 1 year ago

Hi,Developer!

How to understand the function reconstruct_multi_models() and what it does to the model.

Best, Jamie

Note:

amis.py

def reconstruct_multi_models(
        wt_seq,
        model_names=[
            'esm1b',
            'esm1v1',
            'esm1v2',
            'esm1v3',
            'esm1v4',
            'esm1v5',
        ],
        alpha=None,
        return_names=False,
):
    mutations_models, mutations_model_names = {}, {}
    for model_name in model_names:
        model = get_model_name(model_name)
        if alpha is None:
            wt_new = reconstruct(
                wt_seq, model, decode_kwargs={ 'exclude': 'unnatural' }
            )
            mutations_model = diff(wt_seq, wt_new)
        else:
            mutations_model = soft_reconstruct(
                wt_seq, model, alpha=alpha,
            )
        for mutation in mutations_model:
            if mutation not in mutations_models:
                mutations_models[mutation] = 0
                mutations_model_names[mutation] = []
            mutations_models[mutation] += 1
            mutations_model_names[mutation].append(model.name_)
        del model

    if return_names:
        return mutations_models, mutations_model_names

    return mutations_models
SanFran-Me commented 3 months ago

Do you have any idea about how to train this model on my own dataset? I didn't find any script for training

avilella commented 3 months ago

I'd also be keen on doing the same: adding my own protein sequences to the model, then running the new model for predicting new sequences.

On Thu, Aug 22, 2024 at 8:17 AM SanFran-Me @.***> wrote:

Do you have any idea about how to train this model on my own dataset? I didn't find any script for training

— Reply to this email directly, view it on GitHub https://github.com/brianhie/efficient-evolution/issues/38#issuecomment-2303954425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSNYP6HMXICJUO2MKKM3ZSWGCBAVCNFSM6AAAAABM5QEPAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBTHE2TINBSGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

brianhie commented 3 months ago

reconstruct_multi_models() runs the wildtype sequence through an ESM model, selects the amino acid at each position with the maximum likelihood, and then sees where those mutations differ from the wildtype sequence

We use pretrained ESM models to suggest mutations, and an important takeaway from our paper is that general protein language models may in many cases work better than specialized protein language models, a notable example being for antibody evolution.

There are a number of resources describing different ways to finetune ESM https://github.com/facebookresearch/esm/discussions/33 https://huggingface.co/blog/AmelieSchreiber/esm-interact https://aws.amazon.com/blogs/machine-learning/efficiently-fine-tune-the-esm-2-protein-language-model-with-amazon-sagemaker/