memory issue for large dataset

Minhui-Chen commented 4 months ago

Hi,

Nice paper and thanks for sharing the scripts!

I am trying to use your code to transform (acosh, pearson, dino,...) a large dataset (~1M cells), but get out-of-memory error even with 1T memory. Do you have any suggestions how to deal with this memory issue? Thanks a lot for your help!

const-ae commented 4 months ago

Hi Minhui,

I would expect that some methods have no problem handling such a large dataset (e.g., log, acosh). In contrast, others perform much more demanding computations and will need larger amounts of memory (for example, Sanity Distance computes all pairwise distances and has thus had quadratic memory complexity). Below is the complexity of the runtime (Fig. 3b), and I expect somewhat similar patterns for the memory (even though I did not measure them):

Do you need to run all transformations? If yes, I would suggest that you subsample your cells to 10 000, 100 000, 300 000 cells and monitor the peak memory (for details see the Memory chapter of the Advanced R book). This should allow you to identify the culprit and extrapolate to what you should expect for your 1 million cells.

Minhui-Chen commented 4 months ago

Thanks, Constantin!

As you expected, I don't have any problem with log transformation. But, weirdly, acosh crashed because of out-of-memory. Do you have any insight specifically on acosh? No worries if not, I can try your subsample idea. (Thanks for the figure, I like it)

I don't need to run all transformations, but ideally want to try at least one representative method for each of the four transformation approaches. I wonder if you have any idea to circumvent the memory issue, like splitting the large dataset into multiple smaller ones and doing transformations for each of them.

const-ae commented 4 months ago

As you expected, I don't have any problem with log transformation. But, weirdly, acosh crashed because of out-of-memory. Do you have any insight specifically on acosh?

That is surprising to me, because the acosh and the shifted_log are implemented very similarly. Can you load the data in an interactive session, call debugonce(acosh_transform), call the transformation, and then step through the function and tell me where exactly it crashes?

I wonder if you have any idea to circumvent the memory issue, like splitting the large dataset into multiple smaller ones and doing transformations for each of them.

You are probably already aware, but some tools handle large data better if it is stored as a sparse dgCMatrix.

Your idea to split the data, do the transformations separately, and then merge the results, could work for some of the method, but might require a lot of care to not accidentally introduce bias into the results.

Alternatively, you could also look into the DelayedArray data structures. These allow you to store the data on disk and only load small chunks into memory. The transformGamPoi package supports DelayedArrays in the residual_transform, shifted_log_transform and acosh_transform.

Minhui-Chen commented 4 months ago

(Thanks! I am not a R person, give me some time to try and I will update you)

Minhui-Chen commented 4 months ago

I have tried the acosh transformation with debugonce. It stopped at Browse[2]> debug: if (HDF5Array::is_sparse(counts)) { counts <- .handle_data_parameter(data, on_disk, allow_sparse = FALSE) } Browse[2]> debug: dots <- list(...) Browse[2]> debug: overdispersion_shrinkage <- if ("overdispersion_shrinkage" %in% names(dots)) { dots[["overdispersion_shrinkage"]] } else { TRUE } Browse[2]> debug: [1] TRUE Browse[2]> debug: fit <- glmGamPoi::glm_gp(counts, design = ~1, size_factors = size_factors, overdispersion = overdispersion, overdispersion_shrinkage = overdispersion_shrinkage, verbose = verbose) Browse[2]> Killed

Hope this can be helpful to you.

const-ae commented 4 months ago

Hi, this is helpful (I think :))

You seem to be calling acosh_transform with overdispersion = TRUE. This means that transformGamPoi tries to estimate the overdispersion by calling glmGamPoi which fits a GLM (and probably tries to convert your sparse matrix to dense). If you keep the default of overdispersion = 0.05 you shouldn't run into problems with acosh_transform.

Minhui-Chen commented 4 months ago

You are right (of course)! setting alpha to 0.05 solves the memory problem of acosh. Thanks!

const-ae / transformGamPoi-Paper

memory issue for large dataset #9