Open DeppLearning opened 6 years ago
There might be some way to use address(DT)
? And/or attr(DT, '.internal.selfref')
?
Thank you for the suggestion. I had similar attempts but after reading up I believe you simply can't directly, since these are externalptr objects which are of no use outside their corresponding R session.
File-backed data.tables (#1336) might help (?), though I imagine they will have to be locked for editing/writing (like .SD
is) under some conditions.
There is only one way to share memory across processes: one of the processes has to allocate a new shared memory region using mmap
and give it a unique id. This id then have to be communicated to the other process through traditional means (such as pipes). The other process can then mmap
the same memory region via the id that it received. If you think about these ids as file names, then the process is exactly equivalent to one process creating a temporary file, and the other process reading from that file; except that the file will be located in memory. Actually, it doesn't have to -- you can open and share a regular file instead.
Now, a difficult part is to have a data.table
object living in a file. This is something that requires intricate knowledge of R internal API. How do you place a single R vector at a given memory address? How do you ensure that R will not attempt to resize or reclaim this vector? What to do with string columns, where each string is a reference into the global string cache (and the global string cache cannot be shared)?
These are all quite hard questions, and I don't know the answers. However, it is possible in principle. At least Python datatable
solved exactly this problem successfully: a Frame can be saved into a .jay file in one process, then memory-map opened in any number of other processes, and the copy-on-write semantics ensures that no changes can be accidentally leaked.
Moreover, how could the process B know whether the memory is still valid or not (rather than garbage), since the process A may have been killed already.
@st-pasha and @shrektan , The bigmemory
R package that @weltherrschaf references solves this problem for matrix objects (i.e. vectors with dimensions attached), so it can be done. Presumably the bigmemory
internals could be extended to store multiple vectors (i.e. because a data.table is a data.frame, and a data.frame is really a list() of vectors).
There is one caveat to be aware of with this type of file-backed shared-memory objects: some (all?) HPC clusters with multiple nodes hate these. If you create a file-backed shared memory object and try to access it from multiple R sessions the object essentially locks up those processes due to the consistency checks made by the HPC filesystem (because of the possibility those processes might be spread over multiple nodes - even if you explicitly ask for the same node). This is something I had the (dis)pleasure of learning when trying to publish my NetRep R package in my PhD, after paper acceptance passing software review I discovered this problem and ended up having to rip out the internals and quickly learn C++ so I could parallelise the code (through casting to a C++ armadillo matrix and writing multithreaded code that operated on those C++ objects in shared memory).
Actually we could probably just do something simple with the bigmemory package: write a function that converts each column to a big.matrix object, and another that can load those matrix/columns and wrap them in a data.table in your new R session. I might play with this over the weekend.
I used bigmemory for a while and it's not ideal. Attaching a big.matrix can take quite a while. Additionally the package tends to accumulate temp files that you'll might have to clean up yourself once in a while.
I think I'd prefer something like feather (https://github.com/wesm/feather) which apparently uses dplyr, is file backed via the apache arrow format (https://arrow.apache.org/) and can be used from python and a bunch of other languages directly. feather with data.table instead of dplyr would be great.
Since you mentioned feather, you may want to have a look at the fst package (https://github.com/fstpackage/fst) if you haven't already. The roadplan is promising https://github.com/fstpackage/fst/issues/117
fst is great, I used it in my last project and never had any issues with it. I didn't know their roadmap though. Feather/apache arrow is interesting due to it's promise to be able to share data by reference within quite a rich ecosystem of languages and services.
It is my understanding that both fst
and feather
provide fast serialization/deserialization to/from file, but do not allow a data.frame
to actually exist in a file. This may be a reasonable alternative, although it is not true sharing of the data.
At the same time, arrow
attempts to implement a true file-backed data frame, however, this will not be R data.frame (even for primitive types such as integer or numeric, Arrow's format is different from R's). As such, they'd probably need to re-implement all dataframe functionality from scratch...
Follow here for R implementation of arrow
@st-pasha
How do you place a single R vector at a given memory address?
The C API offers allocVector3()
for this purpose.
How do you ensure that R will not attempt to resize or reclaim this vector?
Do you feel this is a problem? Properly PROTECT
ed objects shouldn't be gc
ed or anything, I don't think. Do you have a specific problem in mind here? Or is this more a FUD-type statement?
What to do with string columns ... ?
This might be more of a headache. Maybe ALTREP
s? Maybe this also yields speedups for other areas such as sort()
, unique()
, is.na()
for character vectors as well?
I'm mostly name-checking here. Does anyone with intimate familiarity of the data.table
internals have an opinion on feasibility of shared memory data.tables
?
@nbenn Thanks, this hits the mark. If R has a custom memory allocator mechanism, then it will certainly know to call the user-provided custom de-allocator when the time comes.
@sritchie73 can you shed more light on your experiences with file-backed shared-memory objects in HPC environments? I gather you had problems with file-backed bigmemory::big.matrix
objects, not with shared memory in general? I just did a quick test (small scale: 1-3 cores) with non-file-backed bigmemory::big.matrix
objects and HPC results (LSF managed cluster, CentOS 7, BL460c Gen9 nodes) are consistent with results obtained locally with a MacBook Pro.
I would not expect the file system to interfere with management of shared memory. Furthermore, for example for applying a function to a data.table
in parallel group-by fashion, locking mechanisms are not neccessary at all, as it is guaranteed that writes are never to the same location.
Furthermore, for example for applying a function to a data.table in parallel group-by fashion, locking mechanisms are not neccessary at all, as it is guaranteed that writes are never to the same location.
I guess this depends on i
being empty or having no dupes, since you can do these group-by operations with overlapping groups:
library(data.table)
DT = data.table(id = 1:3)
mDT = data.table(id = c(1L, 2L, 2L, 3L), g = rep(1:2, each=2))
# writes to row 2 twice in a join
DT[mDT, on=.(id), g := i.g, by=.EACHI]
# writes to row 2 twice with row number subset
DT[mDT$id, g := .BY[[1]], by=.(mDT$g)]
@nbenn digging into my old emails, the file system was GPFS, something to do with a conflict with the way the Boost headers used by bigmemory
implement the file-backed shared memory objects (docs here) and the way GPFS handles calls to mmap()
. See also this thread on the R-sig-hpc mailing list: first email here, remaining emails in this thread - note the replies from Jay Emerson, one of the authors and maintainers of the bigmemory
package.
From my limited understanding and experience, it seemed like the filesystem would lock I/O access to the file-backed shared memory objects if multiple processes were trying to access it. My understanding is this was the filesystem's way of ensuring consistency of files across multiple physical nodes. This problem was present whether or not you actually created a backing file on disk, or let bigmemory
store that temporary file purely in memory. Explicitly requesting a single node from the SLURM scheduler also did not alleviate the issue.
The way I got around this was to move all my parallel code from R into C++. I wrote a multithreaded procedure, where each thread gained access to my large matrices via a pointer to each matrix passed to each thread. Use of shared memory in this way worked fine. However, this was a completely different problem to sharing objects across R sessions.
what about using disk.frame ?
https://github.com/xiaodaigh/disk.frame
it supports most of dplyr verbs and data.table syntax
I'm looking into ways of sharing a data.table among several R processes on the same maschine by reference, is there already one that I missed? I'm looking for functionality analogous to this:
https://www.rdocumentation.org/packages/bigmemory/versions/3.12/topics/describe%2C%20attach.big.matrix
Thank you for the great work on this package.