Question about mean-pooled protein embeddings

Thank you for developing and maintaining this tool! I have a question regarding the protein-level embeddings (mean-pooled final hidden layer). If two proteins are structurally similar, or orthologous, should we expect the protein embeddings to be close in L2/cosine distance? For instance, I am noticing that proteins that are orthologous (with >50% sequence identity, see attached example of conserved ribosomal proteins) and similar structure do not cluster when I conduct t-SNE or PCA in the embedding space. How can I explain this observation? Screenshot 2022-12-05 at 15 39 33

facebookresearch / esm

Question about mean-pooled protein embeddings #423