drisso / mbkmeans

K-means clustering for large single-cell datasets
MIT License
9 stars 3 forks source link

mbkmeas-hdf5 could not deal with large (1 million) data at 80% batch size #13

Closed ruoxi430 closed 4 years ago

ruoxi430 commented 5 years ago

When I performed the benchmark tests, I included batch size = 80% (which is an extreme case) as a parameter to be used within mini_batch(), for both mbkmeans and mbkmeans-hdf5 methods. And when the size of the data increased to 1,000,000 cells and 1000 genes, I got an error from mbkmeans-hdf5 method:

Error in mbkmeans::mini_batch(sim_data_hdf5, clusters = 3, batch_size = 0.8 * : std::bad_alloc Execution halted

At batch size = 80%, the regular mbkmeans method could run with 1,000,000 cells and 1000 genes data, and won't throw an error. And I also tested mbkmeans-hdf5 method with 500,000 cells and 1000 genes at batch size = 80%. It didn't throw an error as well.

I prepared an R script which will reproduce the error: https://github.com/stephaniehicks/benchmark-hdf5-clustering/blob/Ruoxi/ongoing_analysis/hdf5_80per_batch.R (note: 20G RAM is needed to run the script.) Please let me know if more information is needed!

drisso commented 4 years ago

Is this resolved?

stephaniehicks commented 4 years ago

I'm not sure, but maybe we close it for now?