JuliaInterop / JuliaCall

Embed Julia in R
https://non-contradiction.github.io/JuliaCall/index.html
Other
267 stars 36 forks source link

Question: is it efficient to pass large dataframe betwen JUlia and R? #114

Closed zjpi closed 5 years ago

zjpi commented 5 years ago

I can't seem to figure this out. The mechanism to pass data between Julia and R is not that efficient right? So I shouldn't write applications that pass a big data frame to Julia process it and then pass it back to R right?

PallHaraldsson commented 5 years ago

EDIT (this should apply for JuliaCall too: https://discourse.julialang.org/t/how-does-rcall-transfer-data-between-r-and-julia/12378/2?u=palli

In RCall every data from R to Julia will invoke function rcopy, which will create a copy of the original R data in Julia (most of the time). [..] So it is neither instantaneous nor as slow as writing and reading a file on a disk. Or it can be seen as “instantaneous” if the size of data is not big. As to the dataframe, I think things will become a little more complicated than a vector. Still, RCall will create a copy in Julia with the same content, and since the dataframe is big, don’t expect it to be “instantaneous”.

And what is the purpose? If you want to read a dataframe in Julia, it should be better to directly read it into Julia than read it using RCall and then copy it to Julia.

It should work (unless you run out of memory by the implicit copying), but efficient, I didn't think so, but now I think I may have had the wrong idea and it should be fast. You could benchmark, or just try. If it's not slow, slower with ever larger objects (i.e. dataframes), then likely no problem,

It's a good question, but I think there might be a forum to ask about such, rather than on GitHub (at least Julia's GitHub isn't for Julia questions). In this case needs not be an R forum (I have in mind Discourse.julialang.org for general Julia discussion, and for RCall.jl possibly and I thus for this project).

I hadn't looked to closely nor have I used this project, but if it does like PyCall.jl does (what I assumed it DIDN'T, and no longer sure of) then it doesn't need to do any copying. In contrast, JavaCall.jl would probably need copying when calling Java from Julia (they work in different processes/address spaces and do not share a garbage collector).

JuliaCall is built on RCall.jl (for calling in the other direction, so same potential limitations should apply when calling in either direction).

RCall.initEmbeddedR (and related) should have the answer:

https://github.com/JuliaInterop/RCall.jl/blob/2d9bb75908d4e1be73766c7375953bd2a2659055/src/setup.jl#L57

juliainterop.github.io/RCall.jl/v0.5.0/public/

Issues (however all now closed) 12, 13 and https://github.com/Non-Contradiction/JuliaCall/issues/16 might have good info: "And the conversion of JuliaArray between R and Julia is slow and take too much memory (related to last problem)"

PallHaraldsson commented 5 years ago

See my edit to previous comment with "it is neither instantaneous nor as slow as writing and reading a file on a disk."

Non-Contradiction commented 5 years ago

Just as Pall said, you need to try this yourself. In general, passing dataframe between Julia and R involves copying in memory, and the performance depends on several factors.

Below is a sketch of the whole R-to-Julia copying process, Julia-to-R process is somewhat similar. From R to Julia: for each columns in the R dataframe, copy it to Julia vector, and then we use the Julia vectors to create a Julia dataframe. The key factors here are the number of columns, the datatype for each column, and the performance of DataFrames.jl to create dataframes from vectors. So suppose you have enough memory, not too many columns, and the datatypes in columns are well-handled by Julia, then the overall performance should be good. Otherwise you'd better to do some benchmarking.

Note: although JuliaCall now provides a wrapper mechanism to suppress the copying, copying is still needed in transitions from pure R dataframe to pure Julia dataframe and vise versa, so at least one or two copying for each dataframe.

xiaodaigh commented 5 years ago

Ok. I got the answer now. It's doing copying of the data in memory. I wonder if a no-copy approach is possible in the future, especially now that arrow is on CRAN.

zjpi commented 5 years ago

The answer is no then. I would only consider an Arrow-style no-copy pass as fast.

Non-Contradiction commented 5 years ago

The arrow project looks very interesting and promising. I think the data in arrow form can be accessed directly from Julia with a little trick and help from some arrow Julia packages.

But I'm afraid that pure R data structure still need to be copied to Julia in most cases for getting around R's C API and utilizing Julia's performance advantage. Or maybe a clever way to interpret R's data structure as Julia's data structure could be brought up and a thus a satisfying no-copy solution. And maybe Arrow R package can be a good starting point. And maybe we can have R -- Arrow -- Julia as an option for JuliaCall?

xiaodaigh commented 5 years ago

Well I think the idea to an R data.frame-like that is composed purely of Arrow vectors? But R -- Arrow -- Julia might be interesting. I don't understand the details and pros and cons.

Non-Contradiction commented 5 years ago

The hypothetical R--Arrow--Julia way will add some overhead to the whole transferring process. So it won't do any good for small objects, but can be very helpful for big ones. The good news is that the current RCall+JuliaCall mechanism makes it very easy to mark certain R objects so that they will follow certain transferring procedure whenever transferring happens. So the hypothetical R--Arrow--Julia method can be used only for the big R objects with a special mark.