SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Cluster and Num_Clusters in report #91

Closed spficklin closed 4 years ago

spficklin commented 5 years ago

I have noticed, that KINC v3 has some problems reporting the Cluster and Num_Cluster. It is reporting a zero-indexed cluster. So, it starts with zero now instead of 1. With KINC v1, it always started with 1 when indexing clusters in the report. That should be a simple fix to put back.

Also, the Num_Clusters value isn't the original number of clusters that were found in the pair. I think it got changed to the number of significant clusters. We need to verify and make sure that it still has those original meanings. It would be nice to have it put back to be the original number of clusters so that we can make some reports about how many multi-modal clusters were found... not just those that are significantly correlated.

FYI...when I wrote the corrpower analytic I maintained the Tripal v3 way of reporting the Cluster and Num_Cluster for consistency.

bentsherman commented 5 years ago

The Cluster column is easy to fix, but I don't know about Num_Clusters. Clusters with correlation < 0.50 are not saved to the CCM / CMX files, so the Extract analytic has no way to determine how many clusters there were originally in a gene pair.

bentsherman commented 5 years ago

I think the easiest way to do what you want -- preserve the original number of clusters found by GMM -- is to implement #73, which will add the GMM parameters to the CMX file. With that information we will know the original number of clusters.

bentsherman commented 5 years ago

Well, actually that issue won't make it any easier to resolve this one.

I think what we'd have to do is to save or discard gene pairs holistically. That is, a gene pair should be either completely saved (if any of its correlations are above 0.5) or completely discarded (if all of it's correlations are below 0.5). That will allow us to preserve the original cluster number, and it feels like a more elegant way to do things anyway. I will write some code to see how much this change would increase the CCM and CMX files.

spficklin commented 5 years ago

What if we just add a new value to each pair that has the original number of clusters? That should require less space as we wouldn't need to store unnecessary pair clusters. If we make it optional then previous files will still be backwards compatible

bentsherman commented 5 years ago

That thought occurred to me as a solution for #73 but I don't think that would be feasible with the way that PairwiseMatrix (base class of CCM and CMX) is designed. PairwiseMatrix uses the number and size of individual clusters, rather than whole pairs, to move through the file quickly. If we add a property that exists only for each pair then we lose that regularity. So I'm hesitant to go down that path.

bentsherman commented 5 years ago

I was able to try out my idea pretty quickly, here we go:

-rw-rw-r-- 1 eceftl3 eceftl3   4947302 Aug  1 09:45 data/Yeast-1000.master.ccm
-rw-rw-r-- 1 eceftl3 eceftl3    672108 Aug  1 09:45 data/Yeast-1000.master.cmx
-rw-rw-r-- 1 eceftl3 eceftl3   7948441 Aug  1 09:47 data/Yeast-1000.preserve-num-cluster.ccm
-rw-rw-r-- 1 eceftl3 eceftl3   1050917 Aug  1 09:47 data/Yeast-1000.preserve-num-cluster.cmx

So about a 60% increase in size... yikes. The runtime was about the same though.

spficklin commented 4 years ago

For a KINC release this month, how about we remove the Num_Clusters columns from the output. This should be an easy fix. The only problem will be with our downstream tools if they expect those columns. I can fix KINC.R fairly easy.

spficklin commented 4 years ago

I'm thinking we should take out all of these column:

Each of those is specific to the gene pair and not to the edge. For example, the missing samples is a bit confusing because a cluster (i.e. edge) effectively has no missing values. The missing values are outside of the cluster. Also, we're effectively storing duplicate information repeatedly and bloating our network files.

Aside from the number of clusters we can determine all that information using the sample string for the edge.

I think we should also rename Cluster_Samples to Cluster_Size

bentsherman commented 4 years ago

@spficklin I have renamed Cluster_Samples to Cluster_Size and I removed the summary statistics you mentioned, with the exception of Num_Clusters. I don't want to remove it as a quick fix if it's going to be added back in later. Pleace refer to my previous comments about saving gene pairs wholistically (rather than discarding individual clusters within a pair) and let me know what you think.

bentsherman commented 4 years ago

Also, while we're talking about column names for the network file, two more questions:

  1. Should we change sc to something more general like Correlation or Similarity?
  2. Do we need the Interaction column? KINC just always sets it to co.
spficklin commented 4 years ago

Thanks @bentsherman. I don't think we will bring back the Num_Clusters column. So let's dump it. I agree we should rename sc to something like Similarity_Score (it may not always be correlation if we put MI back into KINC). And we do need the Interaction column as it is needed when loading the network into Cytoscape.

spficklin commented 4 years ago

I just had a major merge conflict in the extract.cpp file so I'll take care of these adjustments in that branch.

spficklin commented 4 years ago

Okay, I've made the additional changes in the cscm-work branch after I merged in what you had done on the master branch. So, once we merge that branch into master it all should be up to date as we've discussed.

Also, I forgot to respond to your question about saving gene pairs. Are you referring to the new CPM matrix you are working on? If so, yes I think that's a great solution. If I didn't quite get what you meant let me know, or close this out if you have nothing to more.

bentsherman commented 4 years ago

@spficklin I was saying that if you want to preserve the original cluster number, you'll have to save the entire gene pair, even if just one cluster has a significant similarity score. When I tested on Yeast I observed a 60% increase in output file size but the runtime was about the same. So it's up to you, whether you think it's worth the cost in file size.

spficklin commented 4 years ago

No, let's not worry about it for now. I don't want to increase the file size that much. If it comes back up as a specific need then we can address it then.