Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Alternative to assigning attributes in blockApply? #69

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

It seems that the assignment of attributes within the parallelized function of blockApply() triggers a copy to the data structure. I normally wouldn't worry about this, but because it happens inside the parallel section, the copy happens within each worker and this really adds up. For example, if I had a block size of 500 MB and 20 workers, the extra copy would use up an extra 10 GB of memory across all workers.

One solution might be to have blockApply accept an extra argument, say, passGrid=. If this is set to 0, then no attributes are set. If this is set to 1, grid information is stored as attributes in the object as they currently are; this can be the default to preserve backwards compatibility. If this is set to 2, then grid information is explicitly passed as additional arguments to FUN (which would also spare me from having to pull them out of the attributes manually). 0 and 2 would give developers the capability to skip a copy.

For some more context: the cgroups setup on our cluster is killing a lot of our BiocParallel-enabled jobs due to out-of-memory issues. This is most pernicious in the parallel sections where relatively modest inefficiencies are effectively multiplied by the number of workers. Of course, DelayedArray is not the biggest offender here - that honor goes to the forking process, which seems to copy everything in the entire session.

hpages commented 3 years ago

You mean this happens with sparse blocks right?

library(XVector)
a <- array(1:60, 5:3)
XVector:::address(a)
# [1] "0x8e38400"
attr(a, "foo") <- "bar"
XVector:::address(a)
# [1] "0x8e38400"

library(DelayedArray)
sas <- SparseArraySeed(5:3)
XVector:::address(sas)
# [1] "0xf1e36b8"
attr(sas, "foo") <- "bar"
XVector:::address(sas)
# [1] "0xf272580"             # oops!

Darn! I never really liked sticking those attributes on the blocks. This is just how I expedited the problem with the very early versions of blockApply() when blocks were guaranteed to be ordinary arrays. These were the good ol' times! But now I like it even less. I kind of vaguely remember reading somewhere that sticking attributes on S4 objects was no good practice but I didn't really know why. Now I know.

Anyway, I got rid of them: https://github.com/Bioconductor/DelayedArray/commit/a7568941c6a17e51a604c9199bada23b8edc624b

LTLA commented 3 years ago

Thanks Herve. I didn't even know about the effectiveGrid family of functions... better go back through my newly beachmat v3'd code to put that in.

I was pretty sure that the attributes were triggering a copy even for an ordinary matrix. To be more precise, it was triggering an allocation somewhere, of magnitude proportional to what I set in the setAutoBlockSize() - I came to this conclusion by hammering ps on the command line while my R session was running. Not sure what's happening there if the addresses aren't changing, but in any case, your commit fixes it.