SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

add cluster mean and variance to CMX file. #73

Closed spficklin closed 4 years ago

spficklin commented 5 years ago

It would be really useful if we could save, for each cluster, the center point (mean) and the variance matrix. This would add 6 floating point numbers to the cmx file.

Benefits

  1. We could use that information if new data is added without having to re-run GMMs. We can calculate into which cluster new samples should go.
  2. It will be useful for network visualization to see how multiple edges between two nodes are different in terms of their expression.
bentsherman commented 5 years ago

That would be cool. On a related note, the CMX file is currently able to store multiple types of correlations but it isn't used at all by KINC. Is that a feature we want to keep?

spficklin commented 5 years ago

KINC v1.0 lets you run multiple correlations at one time. It was useful at the time because we were comparing Pearson/Spearman correlations and it was faster to do them together than to re-run the networks. With KINC v3.0 running much faster I'm not sure we need to support that. It's quick enough to rebuild the network.

bentsherman commented 5 years ago

Maybe we should implement this feature as a new data type, something like cluster parameter matrix (CPM). It could be an optional output file for the similarity step in case users don't need it, and that way we wouldn't have to modify the CMX or CCM format.

bentsherman commented 5 years ago

Additionally, this feature could be implemented in a separate analytic from similarity. If you have the CCM file then you can compute the mean and covariance for each cluster from the sample string.

spficklin commented 5 years ago

Yeah, I like this idea. Maintains backwards compatibility too. I don't have a preference on either approach.

bentsherman commented 4 years ago

I've implemented this feature in the cpm-data-type branch but it's not working quite yet. I get this error when I try to load a CCM:

Data type given for creation of new data object is invalid.
File: ../../src/core/ace_dataobject.cpp
Function: void Ace::DataObject::makeData(const QString&, const QString&)
Line: 693

I'm at a bit of a loss as to why adding the CPM data type threw off the CCM data type. @4ctrl-alt-del can you look at my branch and see if I did anything wrong? Particularly with the data factory and analytic factory. Here's the branch:

https://github.com/SystemsGenetics/KINC/compare/cpm-data-type

bentsherman commented 4 years ago

I took another look at this branch and was able to fix the issues I had, just pushed to master. KINC now has a export-cpm analytic which takes EMX/CCM input files and produces a CPM file which can be viewed with qkinc.