Closed spficklin closed 4 years ago
The Cluster
column is easy to fix, but I don't know about Num_Clusters
. Clusters with correlation < 0.50 are not saved to the CCM / CMX files, so the Extract analytic has no way to determine how many clusters there were originally in a gene pair.
I think the easiest way to do what you want -- preserve the original number of clusters found by GMM -- is to implement #73, which will add the GMM parameters to the CMX file. With that information we will know the original number of clusters.
Well, actually that issue won't make it any easier to resolve this one.
I think what we'd have to do is to save or discard gene pairs holistically. That is, a gene pair should be either completely saved (if any of its correlations are above 0.5) or completely discarded (if all of it's correlations are below 0.5). That will allow us to preserve the original cluster number, and it feels like a more elegant way to do things anyway. I will write some code to see how much this change would increase the CCM and CMX files.
What if we just add a new value to each pair that has the original number of clusters? That should require less space as we wouldn't need to store unnecessary pair clusters. If we make it optional then previous files will still be backwards compatible
That thought occurred to me as a solution for #73 but I don't think that would be feasible with the way that PairwiseMatrix (base class of CCM and CMX) is designed. PairwiseMatrix uses the number and size of individual clusters, rather than whole pairs, to move through the file quickly. If we add a property that exists only for each pair then we lose that regularity. So I'm hesitant to go down that path.
I was able to try out my idea pretty quickly, here we go:
-rw-rw-r-- 1 eceftl3 eceftl3 4947302 Aug 1 09:45 data/Yeast-1000.master.ccm
-rw-rw-r-- 1 eceftl3 eceftl3 672108 Aug 1 09:45 data/Yeast-1000.master.cmx
-rw-rw-r-- 1 eceftl3 eceftl3 7948441 Aug 1 09:47 data/Yeast-1000.preserve-num-cluster.ccm
-rw-rw-r-- 1 eceftl3 eceftl3 1050917 Aug 1 09:47 data/Yeast-1000.preserve-num-cluster.cmx
So about a 60% increase in size... yikes. The runtime was about the same though.
For a KINC release this month, how about we remove the Num_Clusters
columns from the output. This should be an easy fix. The only problem will be with our downstream tools if they expect those columns. I can fix KINC.R fairly easy.
I'm thinking we should take out all of these column:
Num_Clusters
Missing_Samples
Pair_Outliers
Too_Low
Each of those is specific to the gene pair and not to the edge. For example, the missing samples is a bit confusing because a cluster (i.e. edge) effectively has no missing values. The missing values are outside of the cluster. Also, we're effectively storing duplicate information repeatedly and bloating our network files.
Aside from the number of clusters we can determine all that information using the sample string for the edge.
I think we should also rename Cluster_Samples
to Cluster_Size
@spficklin I have renamed Cluster_Samples
to Cluster_Size
and I removed the summary statistics you mentioned, with the exception of Num_Clusters
. I don't want to remove it as a quick fix if it's going to be added back in later. Pleace refer to my previous comments about saving gene pairs wholistically (rather than discarding individual clusters within a pair) and let me know what you think.
Also, while we're talking about column names for the network file, two more questions:
sc
to something more general like Correlation
or Similarity
?Interaction
column? KINC just always sets it to co
.Thanks @bentsherman. I don't think we will bring back the Num_Clusters
column. So let's dump it. I agree we should rename sc
to something like Similarity_Score
(it may not always be correlation if we put MI back into KINC). And we do need the Interaction
column as it is needed when loading the network into Cytoscape.
I just had a major merge conflict in the extract.cpp
file so I'll take care of these adjustments in that branch.
Okay, I've made the additional changes in the cscm-work
branch after I merged in what you had done on the master branch. So, once we merge that branch into master it all should be up to date as we've discussed.
Also, I forgot to respond to your question about saving gene pairs. Are you referring to the new CPM matrix you are working on? If so, yes I think that's a great solution. If I didn't quite get what you meant let me know, or close this out if you have nothing to more.
@spficklin I was saying that if you want to preserve the original cluster number, you'll have to save the entire gene pair, even if just one cluster has a significant similarity score. When I tested on Yeast I observed a 60% increase in output file size but the runtime was about the same. So it's up to you, whether you think it's worth the cost in file size.
No, let's not worry about it for now. I don't want to increase the file size that much. If it comes back up as a specific need then we can address it then.
I have noticed, that KINC v3 has some problems reporting the
Cluster
andNum_Cluster
. It is reporting a zero-indexed cluster. So, it starts with zero now instead of 1. With KINC v1, it always started with 1 when indexing clusters in the report. That should be a simple fix to put back.Also, the Num_Clusters value isn't the original number of clusters that were found in the pair. I think it got changed to the number of significant clusters. We need to verify and make sure that it still has those original meanings. It would be nice to have it put back to be the original number of clusters so that we can make some reports about how many multi-modal clusters were found... not just those that are significantly correlated.
FYI...when I wrote the corrpower analytic I maintained the Tripal v3 way of reporting the
Cluster
andNum_Cluster
for consistency.