Closed bentsherman closed 6 years ago
Alright so I duplicate your steps and I see the same thing, from what I can tell there is no gene pair with more than one cluster.
Also... why are you saving any clusters if there is just one? Doesn't that mean there are no clusters and you just take all the samples(minus NaN) from each? I hope so because the size of this file is way beyond max if not. Given the GEM created a 12GB file other data sets Stephen have given me would probably make a CCM file greater than 10 TERRABYTES :O
Also I noticed the majority of gene pairs have data in them because you are adding just a single cluster, which completely defeats the purpose of a sparse matrix. Am I missing something?
Oh wait... I see you have an input for minimum and the default is 1... maybe it should be 2?
I will be running a more detailed debugging test now and see if your analytic actually calls CCMatrix::Pair::addCluster(int amount) with amount greater than 1.
My Yeast CCM is only 4.5 GB, not sure what the discrepancy is?
If there is only one cluster then I still save the sample mask because it contains information about NaN samples... now that I think about it, that's not really necessary because an analytic reading the CCM could just scan the EMX for NaN samples. I think I'll go ahead and implement that, it would probably save a lot of space!
Also I just ran K-means again and I didn't see any pairs with more than 1 cluster... so I guess whatever I was seeing before was just due to buggy behavior. There is definitely still some code review to be had with both clustering analytics because they are hard to evaluate, so there may still be subtle bugs in the code.
Oh, and with regard to the default for minimum clusters, the default behavior from KINCv1 is to go from 1 to 5, which I think makes sense. If you don't consider the 1-cluster model then KINC will always assume a model with multiple clusters and then the CCM will definitely be too large!
ok that makes sense, and yes the NaN data is already in the expression matrix so leaving those out would save tons of space.
I am 70% through running kmeans with this added debug code:
void CCMatrix::Pair::addCluster(int amount) const { if ( amount > 1 ) { int yolo {1}; //<-- BREAKPOINT HERE yolo -= 2; } // keep adding a new list of sample masks for given amount while ( amount-- > 0 ) { _sampleMasks.push_back({}); for (int i = 0; i < _cMatrix->_sampleSize ;++i) { _sampleMasks.back().push_back(0); } } }
It has not hit that breakpoint yet so kmeans has not written anything higher than one cluster yet.
Okay, so I made the change and it reduced K-means runtime to 1 minute on my machine. The downside is that K-means never finds multiple clusters. :( I'll investigate later to see why the 1-cluster model always does better.
I will also see how this affects the GMM analytic momentarily.
Alright. I just finished my above debugging test and can confirm kmeans never adds more than a single cluster to its output CCM.
GMM also did not ever find more than one cluster with Yeast. I will investigate both analytics to see why.
How many samples are in the yeast dataset you're using?
The Yeast dataset has 188 samples.
I see okay. I don't think K-means is a priority. I tried it years ago when I was looking for a good clustering method and I wasn't happy with the clusters it identified. So, if it's not working perfectly I think we could put it on the back burner.... But, I'm open to look at it again if you think it's working well.
Well I mainly just use it to help me implement the GMM. But in this case since k-means and GMM are both not finding multiple clusters, maybe it's a piece of code that they have in common that is wrong.
Or could it be the data? Would you expect to find clusters in a dataset like Yeast based on your experience?
I see... it's both methods not finding clusters. Yeah, I would expect that in a sample size that large there would be clusters. We can run it against KINC v1.0 to provide a baseline and see.
It turns out that the serial implementations of both clustering analytics are finding multiple clusters for many gene pairs. It's only the OpenCL implementations that aren't.
I think I have found the root cause, which is my OpenCL implementation of rand()
. I use the same implementation in kmeans.cl
and gmm.cl
which is why this problem occurs in both analytics, but only in the OpenCL implementations. Both analytics use RNG to seed the component means from the data, but rand()
just returns 0 every time, so the means are always initialized to the same value, so every model ends up the same, so the first model (K=1) is always selected for each gene pair.
The point is, I need to make sure that my rand()
implementation works. Also, it could be that rand()
works but the state
variable which I initialize with get_global_id
is wrong (See kmeans.cl
or gmm.cl
to see what I'm talking about). Will investigate further.
It was indeed the way that I was seeding the RNG. Something about OpenCL semantics, but every kernel was seeding rand()
with 0, which just produced a sequence of 0's. Not very random.
Anyway, I just replaced what I had with a POSIX example implementation of rand()
. We don't need a cryptographically secure RNG, we just need something with a decent period, so we should have that now. We can review this rand()
implementation later if we need to.
I should preface by saying that I could be doing something wrong here, and I'm not an expert on the
CCMatrix
source code, but as far as I can tell, everything looks right in the clustering analytics when writing to the cluster matrix. I ran the Yeast GEM through K-means, and I know that some gene pairs produced multiple clusters, but when I view the cluster matrix I never see more than one sample mask per pair. I think either the clustering analytic isn't saving the clusters properly or theCCMatrix
itself isn't displaying the clusters properly. I'd like to see if anyone else can produce the same cluster matrix as me and see for themselves:1, Import the Yeast GEM to emx
The relevant code can be seen in
KMeans::savePair()
andKMeans::runReadBlock()
. The GMM analytic uses the same code but K-means is much quicker to test.