const-ae / lemur

Latent Embedding Multivariate Regression
https://www.bioconductor.org/packages/lemur/
Other
80 stars 7 forks source link

Memory footprint #12

Closed joschif closed 6 months ago

joschif commented 6 months ago

Hi Constantin,

first of all: very cool work, I really like the idea and the implementation and so far it has been working well for us!

The only thing that has made using the package a bit difficult in practice is the memory footprint and the general size of the computed results. We are running this on perturbation screen datasets with >200k cells and some tens of conditions. In these cases, the (dense) DE matrix and the neighborhood DE dataframe together exceed 70 GB / condition, which gets us into the multi-TB territory for all conditions of one experiment. It would be a lot easier to run the package at scale if there was a way to circumvent or reduce the required memory and storage.

It would be great to get your general advice how to run lemur in these scenario.

As a small suggestion, the neighborhood column in the DE result dataframe could store cell indices rather than names, this already cut down it's size by 10-20x for us. Also, do you think there is a way to 'sparsify' the DE matrix somehow? (in many cases most genes anyway have no actual DE in most cells)

Thanks and cheers, Jonas

const-ae commented 6 months ago

Hey,

Thanks for the kind feedback, always happy to hear :)

The only thing that has made using the package a bit difficult in practice is the memory footprint and the general size of the computed results

I explored improving the memory efficiency by keeping the input data sparse and avoiding steps that make the data dense (https://github.com/const-ae/lemur/compare/devel...improve_memory_efficiency). The biggest blocker to merging this work is that rsvd::rpca can return surprisingly imprecise results and I felt unsure how much of this is tolerable. I am sure that the problem is manageable if one chooses the hyperparameters appropriately, but as I couldn't find any guidance on this question, this requires some careful benchmarking, for which I haven't had time yet.

Also, do you think there is a way to 'sparsify' the DE matrix somehow? (in many cases most genes anyway have no actual DE in most cells)

Hmm, interesting suggestion. I am not sure there is an easy way to implement that as the DE matrix comes from subtracting two calls to predict. The only way to do this would be to post hoc set small values to 0, but you would still need to instantiate the large DE matrix.

As a small suggestion, the neighborhood column in the DE result dataframe could store cell indices rather than names, this already cut down it's size by 10-20x for us

I am surprised this makes such a big difference because R is clever about storing strings in vectors. Even though the strings are fairly long, the size of the character vector is only twice the size of an int vector. (For more details see the chapter on string pools in Hadley's Advanced R book.)

make_random_word <- function(length) paste0(sample(letters, size = length, replace = TRUE), collapse = "")
words <- replicate(n=10, make_random_word(length = 200))
number_vec <- sample.int(10, size = 1e6, replace = TRUE)
pryr::object_size(number_vec)
#> 4.00 MB
string_vec <- words[number_vec]
pryr::object_size(string_vec)
#> 8.00 MB

Created on 2024-03-12 with reprex v2.1.0

But if it really makes such a big difference for you, you can make sure that find_de_neighborhoods returns the neighborhood columns with indices, by removing the column names (colnames(fit) <- NULL).


As I am not sure how quickly I will get around to finishing the work on a LEMUR version that respects the input sparsity and as you will anyways get a dense DE matrix at some point, I see a couple of options available:

  1. You mention that you are running LEMUR on ten conditions simultaneously. Instead, consider running LEMUR multiple times with only the reference condition and one perturbation. This will mean that you won't need to carry around the DE prediction for cells that come from unrelated perturbations. The down side is, of course, that you have less power to identify the relevant latent space (but it won't make a difference for the pseudo-bulked DE test!).
  2. Subsampling the cells can be very effective at cutting down memory.
  3. If you can subset to the genes that are most likely to be affected (e.g., are highly enough expressed so that there is even the chance to identify DE), you can also save space.
joschif commented 6 months ago

Thanks a bunch for the suggestions, that's quite helpful!

It's great that you are working on improving the memory efficiency, I'll keep an eye out for updates there. Reducing peak memory usage would definitely be super helpful, but actually for me the larger issue was storing the cumulative results and the time required to read them again :D. But his can be worked around, so I will probably just resort to "minifying" the output myself.

Closing this now, but keep me posted on updates ;)