alepharchives / emdb

EMDB is a NIF library for the Memory-Mapped Database database
Other
47 stars 7 forks source link

Is it really safe to read pointers from get after the txn has finished? #2

Open msackman opened 11 years ago

msackman commented 11 years ago

Just been reading the code. In your c_src drv, in get, you create a txn and do the get and then abort the txn. The pointers to data returned from the get are then copied to the ErlNifBinary which are then sent back to erlang-land.

It's not clear from the mdb docs, but I'd be surprised if those pointers are safe after the txn has been aborted. Having read various bits about how mdb works, once the txn has been finished, there's nothing to stop someone else coming in and modifying those locations, and given how mdb reuses pages, it could well end up with utterly different key-value pairs in those locations.

I think you really have to expose the whole txn api within erlang and if the values really need to exist outside the scope of the txn then you're going to have to copy. I think - I could be wrong though.

I also have some concerns about memory management of those binaries. The value locations should not be freed by erlang when the binary is GC'd as they're pointers straight into the mmap used by mdb. Instead they should just be forgotten about. I think you'd have to use enif_make_resource_binary for that so you can specify a noop dtor. Again I could be wrong - I'm curious as to whether you've considered these issues and found they are safe as you've written them.

hyc commented 11 years ago

I don't know Erlang but it sounds to me like you're right. The pointers returned from MDB are only valid while their transaction is open. Note that in practice, MDB doesn't reuse pages until at least 2 transactions later, but still in a multithreaded app with heavy write activity it's possible for the pages to be reused before the calling reader runs. As for the garbage collection - well, if you're forced to copy the data anyway it's a non-issue. But if you can arrange for the caller to use the data directly (by leaving the txn open until later) then you'd certainly want to prevent any built in memory cleanup from touching those pointers.

msackman commented 11 years ago

In fact, having thought about this further, I don't think exposing the txn API over NIFs will work either - it'll have to be a fully linked in driver.

The reason is that between NIF calls, there's nothing to stop the Erlang VM changing which thread you're running in. I would have thought this would be a common issue with all languages that use green threads of some kind (which is most high level languages). There seem to be two solutions to this:

  1. You have a fully linked-in driver. The driver would then be responsible for spinning up threads and doing the mapping between an Erlang PID and a driver/txn thread. This will hurt performance - or have substantial code complexity if you go down the route of pools of worker threads etc. And even then just the overhead of communicating between the threads will likely hurt performance fairly substantially, especially for small transactions that should be very fast.
  2. MDB gets changed so that all you care about is the txn ID and you no longer care about the thread ID. It seems to me that this is the better solution because all high level languages will then be able to take advantage of this. So the accountancy table that currently tracks thread IDs to database version will change to just be txn ID to database version. Presumably the reason for doing the former was so you can spot if the thread dies and do clean up. But the consequence of that is substantial inflexibility in use which is going to hurt a lot of language bindings.

What do you think?

hyc commented 11 years ago

That introduces quite a lot of new problems. E.g., right now we store the table slot in thread-local storage so that we can use it in future requests without taking any locks. If you break this txn to thread association, then you'll no longer be able to do lockless readers. And of course, you'll be unable to detect dead threads/abandoned txns. You'll also be unable to prevent API misuse, e.g. single thread opening multiple txns.

msackman commented 11 years ago

Yeah, I wondered if there were such considerations.

I've long wrestled myself with the tradeoffs between writing libraries which try to stop programmers doing dumb things versus libraries which are as flexible as possible (and give the programmer more rope). These days I err on giving out more rope and attempting to just document how things shouldn't be abused. To my mind, the dead threads / abandoned txns issue is a programmer issue and MDB shouldn't be trying to cope with such mistakes on the part of the user. However, obviously you're welcome to disagree!

Obviously it's more work, but could you not have a slightly separate API for lockless read-only txns?

I thought you could do single thread doing multiple txns for nested txns? Or is that a slightly different case?

hyc commented 11 years ago

Yes, nested txns is just a convenience - only the bottom-most txn is actually active; it allows you to abort/rollback a sub-operation without affecting the parent. SQLite uses this feature.

hyc commented 11 years ago

In fact the current MDB code doesn't protect against either issue. But certainly as a future TODO item, we want to be able to flush dead reader locks in a running system. (Particularly important since MDB is also multi-process; you don't want to have to stop the world just because one poorly-written app crashed.)

hyc commented 11 years ago

It's late in the game to be changing the MDB API's default behavior, but I might be able to add an envflag to make the txn table behave as you suggest. In that case, all read txns will have to lock the table to acquire a slot, so throughput will be lower, and scalability across multiple CPUs will be poorer. Not sure if it's worth the cost.

msackman commented 11 years ago

Indeed, it's a tricky one. The only reason I'm pushing this is because otherwise, as I say, all languages with a separation between logical user threads and OS threads will have to have drivers that maintain a mapping so as to work correctly with MDB, and that mapping will impose a performance hit.

Now that mapping will not be able to be thread-local so will have, at best, the same hit that doing it in MDB will have, except that if it's done in the driver then you'll have the non-thread-local mapping in the driver, and then a thread-local mapping in MDB too. Better to have one mapping done right rather than a number of independently done mappings done in each language driver.

That's a lot of languages - even some Java VMs with green threads will need this, not to mention Erlang, Haskell, LISPs (probably including clojure though not sure), probably Ruby, Python etc (I might be wrong on those last two - not checked). That's quite a lot of duplication of effort in the drivers, and will then mean that the driver itself is going to be responsible for more of the performance characteristics than you'd probably like: I've seen several projects (including ones I've worked on) "rubbished" because of people doing benchmarks based on drivers which are not well written. Thus allowing drivers to be kept simple seems to me like a good idea.

Obviously though, I appreciate that the main driving use case for MDB is the C implementation of OpenLDAP and that has to take priority.

One point of confusion I currently have is with the multi-process support you mention. Presumably that's an "it's opened once and then stays open for all forks" rather than having completely unrelated processes open the same db file and somehow it figuring out concurrent access? I can't see how the latter can be done but then again I'm not a brilliant C programmer!

hyc commented 11 years ago

No, for multi-process - any number of processes can open the same DB file and work on it concurrently. This is required, to allow you to run a hot backup while other things are operating. It also lets us do cool things like run multiple slapds on the same DB, each listening on different network interfaces (and with different security configs, etc.)

hyc commented 11 years ago

I hear you, re: driver simplicity. We'll probably need to do this.

hyc commented 11 years ago

Hmm. This seems to be an age-old problem with green threads. OpenGL also uses thread-local storage. This Haskell paper goes thru the issues. http://www.haskell.org/~simonmar/papers/conc-ffi.pdf

msackman commented 11 years ago

Anecdotally, I've extensively used the Haskell OpenGL bindings and have never had any issues with them - they are a really solid piece of work, to the extent that I'd not even realised they'd used thread-local storage!

hyc commented 11 years ago

Added a MDB_NOTLS env option. Untested. In the notls branch on gitorious. Please give it a whirl and leave feedback, thanks.

hyc commented 11 years ago

The py-lmdb guys have tested the NOTLS option successfully, so it is now in mdb.master.