Venom-Biochem-Lab / venome

A website to store, visualize, and analyze OSU Venom Biochem Lab proteins
https://venome.cqls.oregonstate.edu
GNU General Public License v3.0
4 stars 1 forks source link

Foldseek structural similarity and clustering #111

Closed xnought closed 7 months ago

xnought commented 8 months ago

cluster our proteins with the protein data bank or uniprot proteins so we can find similar groups. We can also cluster our protein with other proteins in our venome. When we get these clusters, we can search/filter by them too.

xnought commented 8 months ago

Just to bring this into discussion, why would we need to cluster proteins?

First, it would be helpful to filter by something like similar structure. If we had cluster groups, we could filter and narrow down the search. For example, if one cluster was a ring structure, we could search by that and find tons of interesting proteins.

xnought commented 8 months ago

Another reason is that we don't know the function of many of our proteins (if not all). So clustering them into groups where proteins in those clusters are known (like from protein data bank), we could predict their function.

xnought commented 8 months ago

We could also display a view for clustering that helps people spatially find similar proteins. I am thinking of embeddings all the proteins in 2D then coloring by clusters. Then we could even overlay the predicted function on top.

xnought commented 8 months ago

Piggbacking on that. We could annotate these large visualizations over time with clusters or edits. So at some point we have a global map of proteins.

xnought commented 8 months ago

That would be a superpower when exploring proteins

xnought commented 8 months ago
xnought commented 8 months ago

super good idea i'm liking to allow people to do their own clustering with our proteins versuses other. We could for example instead filter by foldseek:cluster-name or some other clusering method someone cam up with with k-means-alphafold-embeddings:cluster-name. And have articles for each cluster and clustering method.

xnought commented 8 months ago

Hit an issue where the foldseek external databases can be only used for search. So will need to download all of pdb