capnproto / capnp-ocaml

OCaml code generator plugin for the Cap'n Proto serialization framework
Other
98 stars 20 forks source link

Using Bigarray as message storage #49

Open andreas opened 6 years ago

andreas commented 6 years ago

The README mentions using Bigarray as message storage, but I haven't been able to find any examples in this repo or elsewhere. I've implemented a module using Bigstring which satisfies Capnp.MessageSig.S, but it's still not clear to me how to serialize/unserialize in a zero-copy fashion, e.g. using Writer and Reader from Async. If you can point to any examples of using Capnp with Bigarray, I would appreciate it.

Thanks 🙏

pelzlpj commented 6 years ago

I'm not aware of any examples using Bigarray or Bigstring. But if you've implemented a module that satisfies Capnp.MessageSig.S, you're just about done. Examples from the benchmark might be helpful: https://github.com/capnproto/capnp-ocaml/blob/master/src/benchmark/capnpCarsales.ml The first line instantiates the Carsales.Make functor on BytesMessage; if you instantiate on your Bigstring-based module instead, in theory that should give you zero-copy semantics for most of the struct field accessors. ("Most" because string fields require a copy for reasons of API practicality.)

Of course, if you want to send your message across some channel, the I/O is going to look different because you're not using Bytes-backed storage. The benchmark is based on Unix read and write (https://github.com/capnproto/capnp-ocaml/blob/master/src/benchmark/methods.ml) and I guess you would need to replace that with something that knows about Bigstring.

talex5 commented 6 years ago

I had a brief look at a Cstruct-backed version once. As I recall, the main problem was that https://github.com/capnproto/capnp-ocaml/blob/master/src/runtime/codecs.mli only works on ByteMessages (but wouldn't be too hard to fix).

pelzlpj commented 6 years ago

Note that if you are actually trying to do message passing via mapped memory, you'll have some extra work to do.

When sending messages across a channel, Cap'n Proto specifies a standardized message framing format as well as a compression scheme. Messages get a small header prepended so that the receiver knows what's coming (how many segments in the message, and how long the segments are). This logic is captured in codecs.mli, and it's not generalized beyond BytesMessage because it wasn't clear whether it makes sense for other message storage formats.

If you're using a shared memory transport, Cap'n Proto does not (yet) specify a format for the message framing information. The process which builds the message has to somehow communicate to the reader process some of the metadata about the message: where are the message segments located within your mapped buffer, and how big are they? You would have to decide on a convention for passing this information, and you would also have to ensure that the builder and reader appropriately synchronize their accesses to the buffer (e.g. with semaphores).

andreas commented 6 years ago

Thanks for the input! My use case is efficiently folding over a large file containing many small messages (current implementation uses bin_prot and suggests there's time to be saved on deserialization).

If I understand correctly, I'll have to handle framing myself as described in the spec:

(4 bytes) The number of segments, minus one (since there is always at least one segment). (N * 4 bytes) The size of each segment, in words. (0 or 4 bytes) Padding up to the next word boundary. The content of each segment, in order.

That seems fairly simple. Feel free to close the issue -- I'll report back if anything meaningful comes out of it.

pelzlpj commented 6 years ago

Hard to know without trying it, but I suspect that Bigstring storage isn't going to help much for that use case. mmap() tricks generally won't outperform read() if you're just walking through a file sequentially. Under that assumption, you might find that IO.create_read_context_for_channel is close to optimal for decoding messages.