a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.02k stars 131 forks source link

PDB structure culstering #314

Open pengzhangzhi opened 1 year ago

pengzhangzhi commented 1 year ago

Hi @a-r-j and @amorehead, I would like to propose a feature: PDB structure clustering. It would be useful for structure-related tasks like structure prediction and generation. Would you be interested in this idea and want to talk about how to implement this feature? I am thinking about using foldseek for clustering and creating metadata containing the clustering information. It would be great if you guys have any comments on this feature!

Best, Zhangzhi

pengzhangzhi commented 1 year ago

Hi @amorehead, I try to reach you by email but got no response. It's possible that my email has ended up in their spam folder or that you have not had the chance to respond yet. Is there any way to reach you in private?

a-r-j commented 1 year ago

Hi @pengzhangzhi this is actually something we planned on adding. You can read our discussion about it in #272. We decided to leave it for the initial release to see if it was something that people would want and, well, it seems like it is :grinning:

If you're keen to work on this I'm happy to support :)

pengzhangzhi commented 1 year ago

Yep! Happy to help! I think the first thing is to figure out the exact features we want. I personally have a use case. I want to cluster all pdb structures into N clusters, where N can be very small like 2. In each cluster, we can further cluster them to derive representative samples. Seems like current tools like foldseek does not support that preset num of clusters.

a-r-j commented 1 year ago

Hmm, what do you think about an approach where you use FoldSeek to get a set of representative clusters, then you can apply some hierarchical clustering method based on the inter-cluster representative structure TM scores?