facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Extracting per-residue/per-protein embeddings on GPU #2

Closed ptynecki closed 4 years ago

ptynecki commented 4 years ago

Hey,

Thank you for doing the research which is needed in order to many biotech issues.

Is there any plan to add support for extracting per-residue embeddings on GPU (multi-GPU)?

...

# Extract per-residue embeddings (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[34])

...

I have another question: how can I apply ESM embedding to get per-protein vector? Is it enough if I will apply mean(dim=0)?

Thanks, Piotr

joshim5 commented 4 years ago

Hi Piotr, thanks for your interest and these great questions! Yes, you can certainly extract per-residue embeddings on GPU. It's as easy as calling model.cuda() before extracting the representations. Here's a short tutorial explaining this in more detail.

To answer your second question, you can get per-protein vectors by averaging the representations. It's a little more complicated than applying mean(dim=0) because it's important to (a) drop the initial beginning of sentence token; and (b) remove all padding tokens. You can use the provided extract.py script with --include mean to do this automatically. Here's the relevant line of code that applies the mean pooling.

I'm closing out this issue, but feel free to reopen if you have any more questions.