methods to plot UMAP for ~200,000 protein structures

avilella commented 1 year ago

Are the methods to plot the UMAP of the metagenomics dataset available?

I would like to generate a similar UMAP representation for about ~200,000 protein structures.

Any ideas where to start? Thx.

tomsercu commented 1 year ago

The code behind the visualization is not released at this time, but generating the umap embeddings is easy. The first step would be to create the NxD matrix of per-protein embeddings for your N=200k proteins and D=1280 (average embeddings of esm-2 or in our case even esm-1b was used for no good reason).

Then using anndata and scanpy libraries you do something like

adata = AnnData(X)
adata.obs_names = mgnifyIDs
sc.pp.neighbors(adata, n_neighbors=15, use_rep='X')
sc.tl.umap(adata)  # default args gave good results, experimented very little with other settings
assert 'X_umap' in adata.obsm
umap_df = adata.obsm.to_df()  # look for columns index / X_umap1 / X_umap2

Fede112 commented 6 months ago

Hi! Any updates on the release of the visualization tools' code? Thanks in advance

facebookresearch / esm

methods to plot UMAP for ~200,000 protein structures #502