fraenkel-lab / OmicsIntegrator2

Prize-Collecting Steiner Forests for Interactomes
https://fraenkel-lab.github.io/OmicsIntegrator2
BSD 3-Clause "New" or "Revised" License
51 stars 24 forks source link

Post Processing: Clustering and Enriching #12

Closed divyaramamoorthy closed 7 years ago

divyaramamoorthy commented 7 years ago

I've compiled a list of what seems to be the most useful clustering and enrichment methods for the output of OmicsIntegrator. Below are the methods that are most frequently used, and should be implemented in OmicsIntegrator2 (listed in order of priority)

Clustering:

As well as clustering, it would also be helpful to have enrichment methods.

If there are any others that you think of, please let me know!

alexlenail commented 7 years ago

@divyaramamoorthy What I really like about the issues section of github is you can decompose some project into individual units of work, and then the issue becomes the central place where the discussion relating to that unit of work takes place. I think that each of the bullet points above are an individual unit of work, so I had been hoping you would write an issue for each of them.

I think it's also worth looking into the networkx documentation for clustering algorithms that make sense in our context which we might get for free (because they're part of networkx). I also believe it makes sense to look for algorithms in the literature we might want to add.

Finally, I don't really understand why k-means and tSNE are in the same bullet. Those seem like very different approaches to me. @AmandaKedaigle @tobi-eh @iamjli am I overlooking something?

AmandaKedaigle commented 7 years ago

I think since there are clearly convos that still need happening about clustering in general (i.e. which ones should we implement and which ones are separate bullets), it's fine as one issue for the moment :joy:

I am unfamiliar with how both k means and tSNE work on graphs, but I'm intrigued! I've heard of running t-SNE to get a two-dimensional mapping and then using K-means to assign to clusters (rather than a human drawing lines), so maybe that's why they're one bullet point? (I don't know if that technique is legit, though, seems like you could just k-means in that scenario)

alexlenail commented 7 years ago

@divyaramamoorthy let us know what you were thinking w.r.t. tSNE.

@iamjli would you propose a strategy for enrichment as post-processing?

divyaramamoorthy commented 7 years ago

For the K-means and tSNE on graph structure question - Mandy, that's exactly the procedure off which I based the combination (using t-SNE for initial dimensionality reduction and then using K-means to delineate clusters), but after looking into it more, I agree with Alex's point that they should be separate methods. I had perceived the combination as potentially an alternative method to reduce dimensionality then cluster the data, although did not realize that a critical issue with combining K-means and t-SNE is that t-SNE intentionally does not provide meaningful distance relationships between the nodes in its clusters, which makes using K-means to differentiate those clusters ineffective.

Another relevant networkx clustering method seems to be the clique percolation method - from what I can understand, this quantifies "densities" in networks based on the distances of adjacent nodes. The method defines communities based on the max union of k-cliques, that can be reached from each other through a series of adjacent k-cliques. This is already implemented with the networkx k_clique_communities method, so I think this would be a potentially helpful addition as a clustering method. Has anyone heard of this/ do you agree that this would be helpful?

I think that the top three clustering methods we should be focusing on are Louvain, Edge betweenness clustering (via the Girvan–Newman algorithm), and the k-clique method. Do you guys agree that these are the top priority/would be most useful/ is there anything that I'm missing? I'd like to implement these methods this weekend and have them done by Tuesday/Wednesday next week if this sounds okay! @AmandaKedaigle @zfrenchee

iamjli commented 7 years ago

@zfrenchee Enrichment should be performed on 1) the general PCSF network solution and 2) individual clusters. See here: https://github.com/fraenkel-lab/PCSF_analysis/tree/master/clustering

The Enrichr query script has a few bugs though.

iamjli commented 7 years ago

We can think of the general network solution as just one giant cluster. So for each clustering method, it should output the necessary files for cytoscape visualization. In particular:

1) edge file with weights 2) node attributes file with protein name, cluster number, node type (steiner, TF, proteomic), robustness, specificity 3) hopefully a file format that can produce something like this: image Mostly just a way to spatially organize the nodes by cluster. The color/node size can be tuned by hand easily.

tobi-eh commented 7 years ago

A thought on @divyaramamoorthy's idea — I was playing around with a similar approach in the past (not on networks though), where, instead of t-SNE, I used the first few dimensions from PCA for dim-reduction pre-clustering. Maybe worth a try?

divyaramamoorthy commented 7 years ago

@iamjli that makes sense, and also agree that it would be important to have a function that writes output so that they're all formatted similarly/ give us the outputs that we need for cytoscape (edge file + node attributes file).

@tobi-eh i've seen similar approaches on (non-network) data for PCA and then clustering - i'll play around with it!

agitter commented 7 years ago

@tobi-eh @divyaramamoorthy As you look at existing clustering implementations, e.g. in scikit-learn, this description of dimensionality reduction followed by k-means sounds like spectral clustering.

The figure @iamjli showed above is exactly what my collaborators often want to see after network analysis, where each subnetwork might be enriched for one or more functional terms.

alexlenail commented 7 years ago

See Divya's notebook! https://github.com/fraenkel-lab/OmicsIntegrator2/blob/master/src/clustering_methods.ipynb

divyaramamoorthy commented 7 years ago

@agitter thank you for the pointer! I'll look into spectral clustering methods and how we can apply them to our networks. Like @zfrenchee said, I uploaded the code for clustering methods - please let me know if you have any advice/recommendations for improving it!

agitter commented 7 years ago

@zfrenchee @divyaramamoorthy Very nice!

I'm not necessarily suggesting we implement spectral clustering, but I wanted to leave the term here in case you do want to try that method and it helps you find existing implementations. I participated in the recent Disease Module Identification DREAM Challenge, and the best method there used spectral clustering.

alexlenail commented 7 years ago

@agitter @divyaramamoorthy There are also a great many spectral clustering techniques, which we'll need to look through to find a sensible one. @iamjli and I were trying to grok Graclus last week... Like @divyaramamoorthy and I were just discussing, networkx is headed towards a big 2.0 release in the near future so we can ask them to implement graph clustering algorithms (or open some PRs to them ourselves). It might make sense to augment the spectral clustering they already support (which seems to be very sparse) with methods that interest us.

alexlenail commented 7 years ago

@agitter do you know which spectral clustering techniques are in vogue these days?

alexlenail commented 7 years ago

In other news, I just wanted to leave behind my enrich gist: https://gist.github.com/zfrenchee/d74890f978c172cbee15975aff949753 I've gotten rid of the bugs which were in the conventional lab version, I think

agitter commented 7 years ago

@zfrenchee There is an R package for spectral clustering that is especially popular, but I'm not sure on the Python side. I would default to scikit-learn's implementation because I think it is a dependency many users already have installed.

The choice of kernel, which scikit-learn calls affinity, is probably more important than the implementation.

alexlenail commented 7 years ago

I just saw a cite with some sort of clustering technique which might interest us: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-S1-S44

alexlenail commented 7 years ago

Thanks everyone! This has been merged into master/prod. (#30)