drisso / mbkmeans

K-means clustering for large single-cell datasets
MIT License
9 stars 3 forks source link

Inconsistent behaviour across Bioc versions #18

Open drisso opened 2 years ago

drisso commented 2 years ago

This was reported by @csoneson on the Bioc Slack. Copying it here.

Initial example:

set.seed(123)
x <- matrix(rnorm(10000 * 50), nrow = 10000)
v <- do.call(rbind, lapply(1:50, function(i) {
  cls <- bluster::clusterRows(x, BLUSPARAM = mbkmeans::MbkmeansParam(centers = 10))
  data.frame(min.size = min(table(cls)),
             max.size = max(table(cls)))
}))
summary(v$min.size) 

Gives:

> summary(v$min.size)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  913.0   932.2   949.5   946.3   960.0   970.0

in Bioc 3.12 and

> summary(v$min.size)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0     1.0   782.5   456.8   859.8   924.0

in Bioc 3.13. The results of 3.12 coincide with bluster::KmeansParam() (which gives consistent results across versions), but the results of 3.13 coincide with mbkmeans::mbkmeans() (which gives consistent results across versions).

There might be something going on with mbkmeans::MbkmeansParam

drisso commented 2 years ago

See here for the original report by Charlotte.

drisso commented 2 years ago

I should note that in all Bioc versions >= 3.13 the results are consistent between bluster::MbkmeansParam() and mbkmeans::MbkmeansParam() and equal to the second output above, which is inconsistent with kmeans, but I don't think that there are any guarantees that it should.

drisso commented 2 years ago

Quick update.

I can reproduce the problem with Bioc docker images.

If I run the following code:

# mbkmeans
set.seed(123)
cls <- bluster::clusterRows(x, BLUSPARAM = mbkmeans::MbkmeansParam(centers = 10))
set.seed(123)
mbk <- mbkmeans(t(x), clusters=10)
set.seed(123)
mb <- mini_batch(x, clusters=10, batch_size = 500, max_iters=100, init_fraction = 500/nrow(x))
set.seed(123)
km <- bluster::clusterRows(x, BLUSPARAM = bluster::KmeansParam(centers = 10))

In Bioc 3.12, I get:

> all(mbk$Clusters == cls)
[1] FALSE
> all(mbk$Clusters == mb$Clusters)
[1] TRUE
> all(km == cls)
[1] TRUE

In Bioc 3.13, I get:

> all(mbk$Clusters == cls)
[1] TRUE
> all(mbk$Clusters == mb$Clusters)
[1] TRUE
> all(km == cls)
[1] FALSE

My suspicion is that somehow mbkmeans::MbkmeansParam is actually running kmeans rather than mbkmeans... since mbkmeans::MbkmeansParam "contains" bluster::KmeansParam it seems plausible that somehow bluster's rowCluster function does not recognize the MbkmeansParam class and reverts back to KmeansParam.

Needs further investigation.

drisso commented 2 years ago

I think this may indeed be the case since in Bioc 3.12 using mbkmeans::clusterRows instead of bluster::clusterRows fixes the issue.

So overall I think that everything works as expected in Bioc >=3.13. There is a bug in 3.12, if one uses bluster::clusterRows in conjunction with mbkeams::MbkmeansParam.

I'm not sure that it's worth pushing a bug fix in Bioc 3.12 at this point, but if we want to I'm still not sure how to fix it.

csoneson commented 2 years ago

@drisso Thanks for looking into this! I think your explanation is plausible (and I also don't think there is a way to fix anything in Bioc 3.12 at this point). For the record, I can also confirm that with Bioc 3.14 I can get identical results with kmeans and mbkmeans if I explicitly set the initial centroids for both, run mbkmeans with a single mini-batch, change the algorithm in kmeans to "Lloyd" and increase the default number of iterations there (I don't think I can specify the initial centroids in the bluster::*Param() functions, wasn't sure if there's a way to make the results identical there):

set.seed(123)
x <- matrix(rnorm(1000 * 50), nrow = 1000)
v <- do.call(rbind, lapply(1:50, function(i) {
    centroids <- x[sample(1:nrow(x), 10), ]
    cls <- mbkmeans::mbkmeans(t(x), clusters = 10, batch_size = nrow(x), 
                              CENTROIDS = centroids)
    cls2 <- stats::kmeans(x, centers = centroids, 
                          algorithm = "Lloyd", iter.max = 100)
    data.frame(min.size.mbkmeans = min(table(cls$Clusters)),
               min.size.kmeans = min(table(cls2$cluster)),
               ARI = mclust::adjustedRandIndex(cls$Clusters, cls2$cluster))
}))

> summary(v$min.size.mbkmeans)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   42.25   61.00   53.98   70.00   83.00
> summary(v$min.size.kmeans)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   42.25   61.00   53.98   70.00   83.00
> summary(v$ARI)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1