When I performed the benchmark tests, I included batch size = 80% (which is an extreme case) as a parameter to be used within mini_batch(), for both mbkmeans and mbkmeans-hdf5 methods. And when the size of the data increased to 1,000,000 cells and 1000 genes, I got an error from mbkmeans-hdf5 method:
At batch size = 80%, the regular mbkmeans method could run with 1,000,000 cells and 1000 genes data, and won't throw an error. And I also tested mbkmeans-hdf5 method with 500,000 cells and 1000 genes at batch size = 80%. It didn't throw an error as well.
When I performed the benchmark tests, I included batch size = 80% (which is an extreme case) as a parameter to be used within
mini_batch()
, for both mbkmeans and mbkmeans-hdf5 methods. And when the size of the data increased to 1,000,000 cells and 1000 genes, I got an error from mbkmeans-hdf5 method:At batch size = 80%, the regular mbkmeans method could run with 1,000,000 cells and 1000 genes data, and won't throw an error. And I also tested mbkmeans-hdf5 method with 500,000 cells and 1000 genes at batch size = 80%. It didn't throw an error as well.
I prepared an R script which will reproduce the error: https://github.com/stephaniehicks/benchmark-hdf5-clustering/blob/Ruoxi/ongoing_analysis/hdf5_80per_batch.R (note: 20G RAM is needed to run the script.) Please let me know if more information is needed!