Get rid of ctypes, use polymorphic interface

madroach commented 5 years ago

This is a huge change.

The biggest change is getting rid of ctypes (now 2x faster).
Use only polymorphic interface, but backed by first-class modules
Use the RESERVE interface when possible
Use GET_MULTIPLE when possible
finer-grained phantom types to encode read/write permissions and support for duplicates
implement more movements for cursor_get
implement iter and map iterators
add appropriate tests
add / adapt documentation
switch from oasis to dune
several small fixes

Drup commented 5 years ago

I really appreciate the enthusiasm, but this is a huge mix of changes, some of which I agree and some of which I really don't, concatened in a single commit with a 4K line diff, how do you expect me to possibly look at that ... ?

madroach commented 5 years ago

Admittedly it is more of a rewrite than a change. Which changes would you like as separate commits?

braibant commented 5 years ago

I happen to also have written some tentative wrapper around lmdb recently (the reason was that historically, @Drup bindings were not on opam and I failed to discover this repository) https://github.com/braibant/ocaml-lmdb

I am really curious about the changes proposed in this pull request:

what's the cost of ctypes that's being removed here? I would have thought that using ctypes was non-controversial when writing bindings (and writing bindings using ctypes does not contain any design decision (almost, proviso apply to string passing): design decisions will be taken and implemented in whatever wrapper is written in ocaml).
do you see a use in having fine grained typing for transactions (read-write, vs read-only)?
one of my goals in the bindings is to reduce the amount of string copies to use C bindings, using similar techniques to: https://github.com/janestreet/zstandard/blob/fc430f5f53c8b5067f1ca9ee242a865ae930c2d1/src/zstandard.mli#L51

madroach commented 5 years ago

what's the cost of ctypes that's being removed here?

The code with custom C-stubs is about 120x faster. I suspect the most costly part of the ctypes bindings is the pointer arithmetics and structure initialisation (mallocs!!) done from the OCaml side. When allocating small buffers (MDB_val) from the C side, they can be allocated on the stack, which is much cheaper. About 44% of the slowdown where caused by 6a37d7f4, which fixes a use-after-free bug.

This bug was well hidden in OCaml code. This is where ctypes can be dangerous, because it sometimes keeps implicit references for the garbage collector and sometimes doesn't. You don't have explicit malloc and free.

In my (limited) experience you won't gain much by the type-safety added by ctypes.

writing bindings using ctypes does not contain any design decision

I agree. Converting to custom C stubs took me about half a day.

do you see a use in having fine grained typing for transactions (read-write, vs read-only)?

The use of phantom types is to trigger errors at compile time instead of runtime and triggering errors reliably even if they are only very rarely triggered at runtime.

one of my goals in the bindings is to reduce the amount of string copies to use C bindings, using similar techniques to zstandard

My C stubs are zero-copy and they pass around only bigstrings. The only memcpy happens in the conversion from OCaml strings to mmapped!! bigstrings. Implementing zero-copy for OCaml strings would only move the memcpy into lmdb, then doing a memcpy from the OCaml heap into the mmap. This is only true for values, not for keys.

madroach commented 5 years ago

@Drup: I can split the diff into smaller changes. But maybe we could first talk about the changes to lmdb.mli and get consensus on the interface?

Drup commented 5 years ago

I would at the very least like the ctypes removal and the rest separated. As we discussed previously, I think you are overdoing it quite a lot with the permission, but I wanted to see the polymorphic API before trying to put my own grain of salt, so bundling those part is fine.

@braibant One of my goal when writing this library was 1) to learn about ctypes 2) to obtain a safe API, and prevent mixing of flags that don't make sense, in particular 3) to obtain an API that was very "Hashtable"-like, which means you can customize both keys and values to arbitrary types, as long as you provide the appropriate serializations.

So, in a sense, it's higher level. I must admit that I'm not an expert on C stubs at all. @madroach The problem with moving this library to raw C stubs is that I stop being able to maintain it. ctypes is reasonably easy for me. Ideally, you could try to pinpoints exactly the places that need finer control, and improve ctypes to support that.

madroach commented 5 years ago

The problem with moving this library to raw C stubs is that I stop being able to maintain it.

Obviously I dealt a lot with your lmdb bindings. So I can offer to (co-)maintain it.

ctypes is reasonably easy for me.

Debugging the use-after-free bug took me several days (!!) reading lmdb C code, ctypes C and OCaml code, running gdb. That's how I learned Ctypes. In contrast, implementing the whole of the custom C stubs took me half a day. So the ctypes interface may be simple, but the (partly implicit) memory management makes it a real pain to debug. It is a huge and complex library.

Ideally, you could try to pinpoints exactly the places that need finer control, and improve ctypes to support that.

there are several problems I see with ctypes:

the implicit memory management. See also https://github.com/ocamllabs/ocaml-ctypes/issues/571. This may be addressed by changes in ctypes.
manipulating C data structures from OCaml is cumbersome. You have to understand firstly the structure layout and semantics of the C structure you are manipulating, then you have to understand how to accomplish this using Ctypes, finally you always need to keep an eye for the garbage collector (see 1.). This is by design and won't change.
Ctypes has no ability to allocate on the C stack and therefore needs to allocate even small transient structures on the heap by calls to malloc. In the lmdb case this has a huge performance impact. This is mostly by design, too and very probably won't change.

Drup commented 5 years ago

Obviously I dealt a lot with your lmdb bindings. So I can offer to (co-)maintain it.

Absolutely, and I was definitely going to propose it. ;) My remarks were not an opposition at all, I just want to be able to evaluate all that a bit separately from the API changes, which are significantly easier for me to look at.

braibant commented 5 years ago

Is the issue with allocating MDB_val boiling down to the fact that you end up reallocating them, or is there something deep going on where you end up allocating small payloads on the stack? If it's the former, I really think that this should be addressed post bindings with ctypes.

Generally, my intuition about ctypes bindings is that they have three layers:

binding the types
binding the functions and generating stubs (the wiggle room often seeming related to "what kind of view should I use when passing strings to C functions")
wrapping the stubs: this is the only "interesting" part, where one will decide how to expose some interface, with different APIS, and where one might see use after free bugs.

In that regard, it could make sense to have multiple APIs exposed in OCaml, using the same backend bindings (the bindings being a commodity). What do you think?

madroach commented 5 years ago

Is the issue with allocating MDB_val boiling down to the fact that you end up reallocating them [...] If it's the former, I really think that this should be addressed post bindings with ctypes.

I'm not sure what you think when you write "reallocate". The point is that malloc gets called very ofter for MDB_vals, which is costly. Of course, you could work around this post-ctypes by implementing a pool of MDB_vals. But how far do you want to go working around problems with ctypes in your OCaml code?

In that regard, it could make sense to have multiple APIs exposed in OCaml, using the same backend bindings (the bindings being a commodity). What do you think?

Yes. That may make sense. I experimented with two apis. One functorial like Drup's interface and one polymorphic. So very much like the stdlib Hashtbl module. When benchmarking them I found the benefit of the functorial interface over the polymorphic one to be negligible. So I propose to drop the functorial interface. A "raw" zero-copy interface can be obtained from the polymorphic (or functorial) interface using a Bigstring converter, which simple returns / accepts bigstrings pointing directly into the lmdb mmap. What other interfaces are you proposing?

braibant commented 5 years ago

To the extreme: do you need to allocate more than 2 MDB_val?

madroach commented 5 years ago

No, you don't need to allocate more than 2 MDB_val. You think about statically allocating them? That would be a good workaround, but it won't work for multithreading.

madroach commented 5 years ago

This is how much I can split up the changes into separate commits without going through too much pain.

For your convenience you may view the api documentation for 88f583c online.

As I wrote above, the failing tests on Linux are due to a bug in lmdb, fixed by f8ce8a82 since lmdb 0.9.21. The OSX travis build uses the current lmdb 0.9.23 and passes with no failing tests.

madroach commented 5 years ago

Hi, I'd like to remind you of this PR. When you have an idea which changes you'd like to merge and which to drop or postpone I would be glad to prepare the appropriate commits. as before, here is the latest api documentation.

Drup commented 5 years ago

Sorry, I'm a bit overworked at the moment, I'll try to take the time to look at it in the upcoming week. While it may take a bit of time, my general intention is to merge this and make you co-maintainer, I just want to properly look at it before. ;)

Drup commented 5 years ago

Ok, you seem to have a good vision on how to push things forward and I think I like most of the changes, so let's take a leap of faith! I'll merge and give you commit bit.

Before we do a new release, I would really like to try to make an effort to simplify the API. The current version of the permission system is, imho, too involved. Be mindful of the classes of bug that really happen in practice. In particular, I do not think permissions on envs are actually useful. Additionally, a personal rule of thumb, is that you should minimize the number of phantom type parameters, and absolutely avoid having more than 1.

If we can't simplify it, so be it, but let's try. :p

Drup / ocaml-lmdb

Get rid of ctypes, use polymorphic interface #16