C++ interop in same address space

mtanski commented 6 years ago

I'm interested in passing Tabular data between C++ and Julia. Is it possible to do this in the same address space using Julia Arrow and C++ Arrow libraries?

xhochy commented 6 years ago

This should be possible to implement, for an example on how we do this between Python/C++ and Java see: https://github.com/apache/arrow/pull/2062

ExpandingMan commented 6 years ago

In principle yes, but in practice I don't know how many extra steps are required on top of what's already here to implement something like that.

In theory I think all that's needed would be for you to dump some data into arrow format (e.g. calling arrowformat on some arrays and using writepadded into a buffer) and then use the appropriate protocol to communicate between the two programs (admittedly I don't know what this part looks like yet).

Help is welcome, but otherwise, stay tuned, there's likely to be some movement on this package in the coming weeks as we try to get it compliant with the main arrow repo.

mtanski commented 6 years ago

How about zero copy in the same address space? Transfer (or better yet, borrow) C++ Arrow table and use it in Julia. My use case is sharing data between C++ and Julia, where the Julia code would be called in the call back (the borrow case) or Julia code would be using the result of the operation (consume, but 0 copy).

ExpandingMan commented 6 years ago

My understanding of how that would work is basically the following:

You'd have to use something (perhaps an API call from another package or C++ wrapper?) to give you a data buffer in Julia which is basically just a Vector{UInt8} (could be an IOBuffer that contains one). It would be up to whatever you use to get that array to make sure this is 0 copy. Unfortunately at the moment I'm totally ignorant about what would be used to perform this initial step, but hopefully it's something simple.
You could then create the various Julia Arrow objects which refer to the appropriate parts of the buffer. In general how this is done depends on the layout of the buffer, I've tried to streamline the layout specification as much as possible with the Locate interface (see README). It may be that there is some sort of standard format and metadata for IPC, in fact I think there was at least some of that, that's something that still should be implemented in Arrow.jl that isn't. In any case, creating the ArrowVector objects will not do any copying.
You will then have some ArrowVector objects which you can read from however you want. The semantics are the same as Array. So, if you have an ArrowVector v and do v[idx] this will create a copy for the indices idx. If you do view(v, idx) or @view v[idx], this will create a view so that there is no copying.

Sorry I can't be of more help, certainly this is not enough to get something really polished, but perhaps it's enough for a rough implementation? Again, having something really polished for this depends to a large degree on the standardization of data layouts, the Arrow format is quite general. (I need to go back and review the IPC stuff though, there's probably something.)

ExpandingMan / Arrow.jl

C++ interop in same address space #29