Pre-training and fine-tuning scripts

facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

MIT License

3.26k stars 643 forks source link

Pre-training and fine-tuning scripts #11

Closed wuzhen247 closed 4 years ago

wuzhen247 commented 4 years ago

Great work! I find no model pre-training and downstream task fine-tuning scripts in the repository. Could you provide them?

Thanks.

joshim5 commented 4 years ago

Hi @wuzhen247, thanks for your interest! In the examples folder, we provide a tutorial for training a downstream model to predict the fitness of protein variants.

At this time, we do not provide model pre-training in ESM. Internally, we trained these models using the fairseq toolkit. We highly recommend using fairseq for pre-training new models.

Hope that helps! Feel free to reopen the issue if you have any more questions.

wuzhen247 commented 4 years ago

Hi @wuzhen247, thanks for your interest! In the examples folder, we provide a tutorial for training a downstream model to predict the fitness of protein variants.

At this time, we do not provide model pre-training in ESM. Internally, we trained these models using the fairseq toolkit. We highly recommend using fairseq for pre-training new models.

Hope that helps! Feel free to reopen the issue if you have any more questions.

Thanks for your suggestions and fast response. I will try them.

nasserhashemi commented 3 years ago

Hi; Thanks so much for your great work; it is so useful I have a question regarding the above issue: In the example you mentioned: "Our embeddings are stored with the file name from fasta header: {index}|{mutation_id}|{effect}.pt" So, how you did that? I mean how you convert the seq in fasta file to emebedded file ( i mean .pt) ? Thanks again

Nasser

joshim5 commented 3 years ago

Hi @nasserhashemi, this is described under "prerequisites" in the example notebook. Pasting below for your convenience.

You have obtained sequence embeddings for ß-lactamase as described in the README, either by: running python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ --repr_layers 34 --include mean OR for your convenience we precomputed the embeddings and you can download them from here - see below to download this right here from in this notebook

nasserhashemi commented 3 years ago

Oh, I see, great, Thanks so much for your prompt reply;