[Go] ipc.Writer Option to skip appending data buffers

apache / arrow-go

Official Go implementation of Apache Arrow

https://arrow.apache.org/

Apache License 2.0

43 stars 7 forks source link

[Go] ipc.Writer Option to skip appending data buffers #76

Open asfimport opened 5 years ago

asfimport commented 5 years ago

For cases where we have a known shared memory region, it would be great if the ipc.Writer (and by extension ipc.Reader?) had the ability to write out everything but the actual buffers holding the data. That way we can still utilize the ipc mechanisms to communicate without having to serialize all the underlying data across the wire.

This seems like it should be possible since the RecordBatch flatbuffers only contain the metadata and the underlying data buffers are appended later. We just need to skip appending the underlying data buffers.

@sbinet thoughts?

Reporter: Nick Poorman / @nickpoorman

_{Note: This issue was originally created as ARROW-6107. Please see the migration documentation for further details.}

asfimport commented 5 years ago

Sebastien Binet / @sbinet: not saying it wouldn't be advisable nor doable, but: if it's already in a shmem region, why not just use that already?

(and I guess it's kind of implementing: https://issues.apache.org/jira/browse/ARROW-4852)

asfimport commented 5 years ago

Nick Poorman / @nickpoorman: https://issues.apache.org/jira/browse/ARROW-4852 Is the same use case I'm thinking of.

If you have an Arrow Table in C (or Python) and you want to access the data in Go, you can pass a pointer back from C to the underlying data buffers. However, you still have to collect all the metadata to utilize the buffers. Making CGO calls is slow, so being able to pass a pointer to the data buffers and a pointer to the serialized metadata would ensure a more constant time when crossing the language boundary.

I did a simple POC to demonstrate what it would take to collect all the information from Python and re-materialize it in Go. https://github.com/nickpoorman/go-py-arrow-bridge The bottleneck is the number of CGO calls required to fetch all the metadata.

asfimport commented 5 years ago

Sebastien Binet / @sbinet: ok.

(just nit-picking but to really assess the CGo overhead, one should directly call C, not C++-via-python :P. that said, it's a nice PoC.)

SGTM.