facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.26k stars 643 forks source link

Question about mean-pooled protein embeddings #423

Closed y-hwang closed 1 year ago

y-hwang commented 1 year ago

Thank you for developing and maintaining this tool! I have a question regarding the protein-level embeddings (mean-pooled final hidden layer). If two proteins are structurally similar, or orthologous, should we expect the protein embeddings to be close in L2/cosine distance? For instance, I am noticing that proteins that are orthologous (with >50% sequence identity, see attached example of conserved ribosomal proteins) and similar structure do not cluster when I conduct t-SNE or PCA in the embedding space. How can I explain this observation? Screenshot 2022-12-05 at 15 39 33