facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

methods to plot UMAP for ~200,000 protein structures #502

Open avilella opened 1 year ago

avilella commented 1 year ago

Are the methods to plot the UMAP of the metagenomics dataset available?

I would like to generate a similar UMAP representation for about ~200,000 protein structures.

Any ideas where to start? Thx.

tomsercu commented 1 year ago

The code behind the visualization is not released at this time, but generating the umap embeddings is easy. The first step would be to create the NxD matrix of per-protein embeddings for your N=200k proteins and D=1280 (average embeddings of esm-2 or in our case even esm-1b was used for no good reason).

Then using anndata and scanpy libraries you do something like

adata = AnnData(X)
adata.obs_names = mgnifyIDs
sc.pp.neighbors(adata, n_neighbors=15, use_rep='X')
sc.tl.umap(adata)  # default args gave good results, experimented very little with other settings
assert 'X_umap' in adata.obsm
umap_df = adata.obsm.to_df()  # look for columns index / X_umap1 / X_umap2
Fede112 commented 6 months ago

Hi! Any updates on the release of the visualization tools' code? Thanks in advance