Share data.table among R sessions by reference

DeppLearning commented 6 years ago

I'm looking into ways of sharing a data.table among several R processes on the same maschine by reference, is there already one that I missed? I'm looking for functionality analogous to this:

https://www.rdocumentation.org/packages/bigmemory/versions/3.12/topics/describe%2C%20attach.big.matrix

Thank you for the great work on this package.

MichaelChirico commented 6 years ago

There might be some way to use address(DT)? And/or attr(DT, '.internal.selfref')?

DeppLearning commented 6 years ago

Thank you for the suggestion. I had similar attempts but after reading up I believe you simply can't directly, since these are externalptr objects which are of no use outside their corresponding R session.

franknarf1 commented 6 years ago

File-backed data.tables (#1336) might help (?), though I imagine they will have to be locked for editing/writing (like .SD is) under some conditions.

st-pasha commented 6 years ago

There is only one way to share memory across processes: one of the processes has to allocate a new shared memory region using mmap and give it a unique id. This id then have to be communicated to the other process through traditional means (such as pipes). The other process can then mmap the same memory region via the id that it received. If you think about these ids as file names, then the process is exactly equivalent to one process creating a temporary file, and the other process reading from that file; except that the file will be located in memory. Actually, it doesn't have to -- you can open and share a regular file instead.

Now, a difficult part is to have a data.table object living in a file. This is something that requires intricate knowledge of R internal API. How do you place a single R vector at a given memory address? How do you ensure that R will not attempt to resize or reclaim this vector? What to do with string columns, where each string is a reference into the global string cache (and the global string cache cannot be shared)?

These are all quite hard questions, and I don't know the answers. However, it is possible in principle. At least Python datatable solved exactly this problem successfully: a Frame can be saved into a .jay file in one process, then memory-map opened in any number of other processes, and the copy-on-write semantics ensures that no changes can be accidentally leaked.

shrektan commented 6 years ago

Moreover, how could the process B know whether the memory is still valid or not (rather than garbage), since the process A may have been killed already.

sritchie73 commented 6 years ago

@st-pasha and @shrektan , The bigmemory R package that @weltherrschaf references solves this problem for matrix objects (i.e. vectors with dimensions attached), so it can be done. Presumably the bigmemory internals could be extended to store multiple vectors (i.e. because a data.table is a data.frame, and a data.frame is really a list() of vectors).

There is one caveat to be aware of with this type of file-backed shared-memory objects: some (all?) HPC clusters with multiple nodes hate these. If you create a file-backed shared memory object and try to access it from multiple R sessions the object essentially locks up those processes due to the consistency checks made by the HPC filesystem (because of the possibility those processes might be spread over multiple nodes - even if you explicitly ask for the same node). This is something I had the (dis)pleasure of learning when trying to publish my NetRep R package in my PhD, after paper acceptance passing software review I discovered this problem and ended up having to rip out the internals and quickly learn C++ so I could parallelise the code (through casting to a C++ armadillo matrix and writing multithreaded code that operated on those C++ objects in shared memory).

sritchie73 commented 6 years ago

Actually we could probably just do something simple with the bigmemory package: write a function that converts each column to a big.matrix object, and another that can load those matrix/columns and wrap them in a data.table in your new R session. I might play with this over the weekend.

DeppLearning commented 6 years ago

I used bigmemory for a while and it's not ideal. Attaching a big.matrix can take quite a while. Additionally the package tends to accumulate temp files that you'll might have to clean up yourself once in a while.
I think I'd prefer something like feather (https://github.com/wesm/feather) which apparently uses dplyr, is file backed via the apache arrow format (https://arrow.apache.org/) and can be used from python and a bunch of other languages directly. feather with data.table instead of dplyr would be great.

ChristK commented 6 years ago

Since you mentioned feather, you may want to have a look at the fst package (https://github.com/fstpackage/fst) if you haven't already. The roadplan is promising https://github.com/fstpackage/fst/issues/117

DeppLearning commented 6 years ago

fst is great, I used it in my last project and never had any issues with it. I didn't know their roadmap though. Feather/apache arrow is interesting due to it's promise to be able to share data by reference within quite a rich ecosystem of languages and services.

st-pasha commented 6 years ago

It is my understanding that both fst and feather provide fast serialization/deserialization to/from file, but do not allow a data.frame to actually exist in a file. This may be a reasonable alternative, although it is not true sharing of the data.

At the same time, arrow attempts to implement a true file-backed data frame, however, this will not be R data.frame (even for primitive types such as integer or numeric, Arrow's format is different from R's). As such, they'd probably need to re-implement all dataframe functionality from scratch...

MichaelChirico commented 6 years ago

Follow here for R implementation of arrow

https://github.com/romainfrancois/arrow

nbenn commented 5 years ago

@st-pasha

How do you place a single R vector at a given memory address?

The C API offers allocVector3() for this purpose.

How do you ensure that R will not attempt to resize or reclaim this vector?

Do you feel this is a problem? Properly PROTECTed objects shouldn't be gced or anything, I don't think. Do you have a specific problem in mind here? Or is this more a FUD-type statement?

What to do with string columns ... ?

This might be more of a headache. Maybe ALTREPs? Maybe this also yields speedups for other areas such as sort(), unique(), is.na() for character vectors as well?

I'm mostly name-checking here. Does anyone with intimate familiarity of the data.table internals have an opinion on feasibility of shared memory data.tables?

st-pasha commented 5 years ago

@nbenn Thanks, this hits the mark. If R has a custom memory allocator mechanism, then it will certainly know to call the user-provided custom de-allocator when the time comes.

nbenn commented 5 years ago

@sritchie73 can you shed more light on your experiences with file-backed shared-memory objects in HPC environments? I gather you had problems with file-backed bigmemory::big.matrix objects, not with shared memory in general? I just did a quick test (small scale: 1-3 cores) with non-file-backed bigmemory::big.matrix objects and HPC results (LSF managed cluster, CentOS 7, BL460c Gen9 nodes) are consistent with results obtained locally with a MacBook Pro.

I would not expect the file system to interfere with management of shared memory. Furthermore, for example for applying a function to a data.table in parallel group-by fashion, locking mechanisms are not neccessary at all, as it is guaranteed that writes are never to the same location.

franknarf1 commented 5 years ago

Furthermore, for example for applying a function to a data.table in parallel group-by fashion, locking mechanisms are not neccessary at all, as it is guaranteed that writes are never to the same location.

I guess this depends on i being empty or having no dupes, since you can do these group-by operations with overlapping groups:

library(data.table)
DT = data.table(id = 1:3)
mDT = data.table(id = c(1L, 2L, 2L, 3L), g = rep(1:2, each=2))

# writes to row 2 twice in a join
DT[mDT, on=.(id), g := i.g, by=.EACHI] 

# writes to row 2 twice with row number subset
DT[mDT$id, g := .BY[[1]], by=.(mDT$g)]

sritchie73 commented 5 years ago

@nbenn digging into my old emails, the file system was GPFS, something to do with a conflict with the way the Boost headers used by bigmemory implement the file-backed shared memory objects (docs here) and the way GPFS handles calls to mmap(). See also this thread on the R-sig-hpc mailing list: first email here, remaining emails in this thread - note the replies from Jay Emerson, one of the authors and maintainers of the bigmemory package.

From my limited understanding and experience, it seemed like the filesystem would lock I/O access to the file-backed shared memory objects if multiple processes were trying to access it. My understanding is this was the filesystem's way of ensuring consistency of files across multiple physical nodes. This problem was present whether or not you actually created a backing file on disk, or let bigmemory store that temporary file purely in memory. Explicitly requesting a single node from the SLURM scheduler also did not alleviate the issue.

The way I got around this was to move all my parallel code from R into C++. I wrote a multithreaded procedure, where each thread gained access to my large matrices via a pointer to each matrix passed to each thread. Use of shared memory in this way worked fine. However, this was a completely different problem to sharing objects across R sessions.

HikaGenji commented 4 years ago

what about using disk.frame ?

https://github.com/xiaodaigh/disk.frame

it supports most of dplyr verbs and data.table syntax

Rdatatable / data.table

Share data.table among R sessions by reference #3104