Open drisso opened 2 years ago
I should note that in all Bioc versions >= 3.13 the results are consistent between bluster::MbkmeansParam()
and mbkmeans::MbkmeansParam()
and equal to the second output above, which is inconsistent with kmeans, but I don't think that there are any guarantees that it should.
Quick update.
I can reproduce the problem with Bioc docker images.
If I run the following code:
# mbkmeans
set.seed(123)
cls <- bluster::clusterRows(x, BLUSPARAM = mbkmeans::MbkmeansParam(centers = 10))
set.seed(123)
mbk <- mbkmeans(t(x), clusters=10)
set.seed(123)
mb <- mini_batch(x, clusters=10, batch_size = 500, max_iters=100, init_fraction = 500/nrow(x))
set.seed(123)
km <- bluster::clusterRows(x, BLUSPARAM = bluster::KmeansParam(centers = 10))
In Bioc 3.12, I get:
> all(mbk$Clusters == cls)
[1] FALSE
> all(mbk$Clusters == mb$Clusters)
[1] TRUE
> all(km == cls)
[1] TRUE
In Bioc 3.13, I get:
> all(mbk$Clusters == cls)
[1] TRUE
> all(mbk$Clusters == mb$Clusters)
[1] TRUE
> all(km == cls)
[1] FALSE
My suspicion is that somehow mbkmeans::MbkmeansParam
is actually running kmeans rather than mbkmeans...
since mbkmeans::MbkmeansParam
"contains" bluster::KmeansParam
it seems plausible that somehow bluster
's rowCluster
function does not recognize the MbkmeansParam
class and reverts back to KmeansParam
.
Needs further investigation.
I think this may indeed be the case since in Bioc 3.12 using mbkmeans::clusterRows
instead of bluster::clusterRows
fixes the issue.
So overall I think that everything works as expected in Bioc >=3.13. There is a bug in 3.12, if one uses bluster::clusterRows
in conjunction with mbkeams::MbkmeansParam
.
I'm not sure that it's worth pushing a bug fix in Bioc 3.12 at this point, but if we want to I'm still not sure how to fix it.
@drisso Thanks for looking into this! I think your explanation is plausible (and I also don't think there is a way to fix anything in Bioc 3.12 at this point).
For the record, I can also confirm that with Bioc 3.14 I can get identical results with kmeans
and mbkmeans
if I explicitly set the initial centroids for both, run mbkmeans
with a single mini-batch, change the algorithm in kmeans
to "Lloyd"
and increase the default number of iterations there (I don't think I can specify the initial centroids in the bluster::*Param()
functions, wasn't sure if there's a way to make the results identical there):
set.seed(123)
x <- matrix(rnorm(1000 * 50), nrow = 1000)
v <- do.call(rbind, lapply(1:50, function(i) {
centroids <- x[sample(1:nrow(x), 10), ]
cls <- mbkmeans::mbkmeans(t(x), clusters = 10, batch_size = nrow(x),
CENTROIDS = centroids)
cls2 <- stats::kmeans(x, centers = centroids,
algorithm = "Lloyd", iter.max = 100)
data.frame(min.size.mbkmeans = min(table(cls$Clusters)),
min.size.kmeans = min(table(cls2$cluster)),
ARI = mclust::adjustedRandIndex(cls$Clusters, cls2$cluster))
}))
> summary(v$min.size.mbkmeans)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 42.25 61.00 53.98 70.00 83.00
> summary(v$min.size.kmeans)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 42.25 61.00 53.98 70.00 83.00
> summary(v$ARI)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
This was reported by @csoneson on the Bioc Slack. Copying it here.
Initial example:
Gives:
in Bioc 3.12 and
in Bioc 3.13. The results of 3.12 coincide with
bluster::KmeansParam()
(which gives consistent results across versions), but the results of 3.13 coincide withmbkmeans::mbkmeans()
(which gives consistent results across versions).There might be something going on with
mbkmeans::MbkmeansParam