Open mwotton opened 10 years ago
Yeah, please send me the benchmark code. I've been away and doing a lot of conference stuff this summer so I've fallen off my regular commits, but this is definitely something that needs to be fixed.
Very curious that it affects when a host has two daemons running, I'm wondering if it's a concurrency issue.
https://gist.github.com/mwotton/a589bdfd73b9e7d87c51 and the python https://gist.github.com/mwotton/ef20e2c3575a1f80504a
I'm manually using your ./scripts/start-hyperdex.sh script and just commenting out some of the hyperdex daemon lines. Run the actual benchmark with
REPS=100 ./hyhac-benchmarks hyperdex
you might want to comment out the cassandra and sqlite stuff unless you're keen on getting a comparison.
I do see exactly what you're saying though, that the estimated time goes from tens of seconds to thousands, as much as 3500+ when I start 5 daemons.
I think the difference between the Haskell and Python examples is that the Haskell one is building a very large list of lazy thunks that consumes a great deal of memory. For the Haskell library, I defaulted to async operations. In fact, in a way they're "doubly asynchronous". Most of the operations return an IO
of IO
of ...
. The outer IO action sends the request, the inner waits for its completion.
When I change the code like so, I get much better performance:
bench "hyperdex" $ finish (\(x,_,_) ->
join $ -- <- this makes the put operation synchronous, joining the inner and outer IO
H.put client "phonebook" x $!!
[H.mkAttribute "content" lastname])
(\actions -> do
let failures = lefts actions
when (failures /= []) $
error ("failure in hyperdex: " ++ show failures)
return ())
The result of that change is that the entire benchmark goes to a mean of 27ms, or 2.7 seconds for 100 iterations. As for why your one example causes severe performance issues when going to multiple daemons, I do not know. I'm going to experiment further.
I think I am a little closer to understanding the issue, but I am not certain and my debugging/performance analysis-fu is not strong enough to confirm this, but I think it might be related to contention on the HyperClient objects stored internally when multiple daemons are used. Like, there may be additional pauses and the locks may take longer when there are multiple daemons. HyperDex requires that all access to the HyperClient be made thread-safe by the use of locks, which I implemented with MVar
s.
When I change your benchmark code to, instead of making every put
synchronous, using a connection pool of multiple clients, like so, I actually get even better performance than before for one to five daemons. The mean time drops to 16ms from 27ms for both cases.
I added a dependency on resource-pool
and imported Data.Pool
, then added this line in the setup part of your main:
hyhacPool <- createPool (H.connect H.defaultConnectInfo) H.close 4 0.5 10
And then changed your benchmark code to:
bench "hyperdex" $ finish (\(x,_,_) ->
withResource hyhacPool $ \client ->
H.put client "phonebook" x $!!
[H.mkAttribute "content" lastname])
(\actions -> do
failures <- lefts <$> sequence actions
when (failures /= []) $
error ("failure in hyperdex: " ++ show failures)
return ())
With that, no matter how many daemons, performance was superb.
_Edit_: Well, until I ran the benchmark again and hyhac reported a slew of failures in the form of HyperclientGarbage
responses. :(
With the help of ThreadScope I think I've narrowed it down to me being a victim of my own premature optimization. When the backoff setting is set to BackoffExponential
what I saw on the threadscope were a very large number of 0.5 second thread delays. I had not realized that blocked the entire GHC thread, which I'm guessing caused a great deal of contention/latency in handling requests. Since I'm not using any concurrency internally, no forkIO
or par
, perhaps it's possible the threadDelay
was affecting more than just the Haskell code, but was preventing the HyperDex client from receiving information from a daemon.
That is my working theory at least, and it seems to be borne out by the result that even the original benchmark no longer suffers from the performance issue you saw.
I cannot explain, at this time, why I saw 16ms average times on a few initial runs of the code using a resource pool, but since I am running the benchmarks in a virtual machine it could simply be variance by virtue of not running on bare metal.
Can you explain me what the idea behind the Backoff
is? hyperdex_client_loop
blocks automatically for us until something new is available, does it not?
I believe hyperdex_client_loop
is non-blocking, in the sense that if there is nothing in the loop queue then no result will be returned. Backoff
is a premature optimization on my part to ensure that I am not hitting hyperdex_client_loop
too hard when trying to synchronously interact with the client library. It's premature because blocking the thread causes more problems than it solves.
From http://hyperdex.org/doc/latest/CAPI/#sec:api:c:client:
It’s always possible to use the library in a synchronous manner by immediately following every operation with a call to hyperdex_client_loop.
and
HYPERDEX_CLIENT_TIMEOUT The hyperdex_client_loop operation exceeded its timeout without completing an outstanding operation. This does not affect the status of any outstanding operation.
This looks to me like hyperdex_client_loop
will block until some event happens or the timeout is over.
I always set timeout to 0 because I don't want the GHC runtime to be stuck spinning on some foreign function that it can't escape from.
I always set timeout to 0 because I don't want the GHC runtime to be stuck spinning on some foreign function that it can't escape from.
I don't think this is necessary. Have a look at http://hyperdex.org/doc/latest/CAPI/#sec:api:c:client:signals:
10.1.12 Working with Signals
The HyperDex client library provides a simple mechanism to cleanly integrate with applications that work with signals.
Your application must mask all signals prior to making any calls into the library. The library will unmask the signals during blocking operations and return HYPERDEX_CLIENT_INTERRUPTED should any signals be received.
hyperdex_client_loop
writes a hyperdex_client_returncode*
, so we will get HYPERDEX_CLIENT_INTERRUPTED
in case of a signal.
This means we can use foreign import ccall interruptible
, and so keep the timeout while still being interruptible.
I don't think we're considering the same issue and perhaps that's because right now every function import is marked safe
. That may change soon, because nothing in the HyperDex API will call back into Haskell. Marking these functions as unsafe
might improve performance, but it bears a cost. Foreign functions marked safe
do not block a capability, but have a higher fixed overhead. Foreign functions marked unsafe
block an entire capability, but have a lower fixed overhead. If I have a timeout of 0, then there are lots of yield points even in an unsafe
call to hyperdex_client_loop
that are available to the RTS. It also means that the time slice given to the Haskell thread will be more fair.
Tuning the timeout parameter and the call parameters is something that will have to be done experimentally. It could be, for all I know, that safe
calls and a non-zero timeout cause the least amount of contention because they cause the fewest number of new OS threads to be spawned and the cost of transferring Haskell threads to new capabilities is low enough. It could be that they cause the greatest amount of contention, if the fixed overhead of the calls and the behind-the-scenes implementation of the HyperDex library causes contention between its thread and GHC's.
safe calls don't block haskell RTS.
unless your DB call takes less than a micro second (NEVER :) ), do not use unsafe ffi calls. Please. Dont. Please.
The overhead of a safe ffi call is ~ 250ns, which is negligible.
Please take all of the following with a grain of salt, I don't know all of this half as well as I should. So, take the statements of fact that follow in the context that this is merely what I believe is happening.
The DB call isn't actually a DB call, which is what makes this situation peculiar. The HyperDex library, as near as I can tell, never actually synchronously does anything. Instead, the library is a shim around an asynchronous queue that is polled by hyperdex_client_loop
. The call to hyperdex_client_loop
doesn't hit the network, only the local queue which is maintained by the library. I believe the whole thing wraps around BusyBee. Because of this, setting a timeout on hyperdex_client_loop
amounts to a busy wait. During that busy wait, even if we have a safe
FFI call, we are going to be causing phenomenal contention for the CPU. Given this, I fully expect that hyperdex_client_loop will return in less than 1 microsecond for most calls.
Long answer coming :)
I don't think we're considering the same issue and perhaps that's because right now every function import is marked safe.
What I meant was that if we use a timeout of 0, hyperdex_client_loop
returns immediately and we get a spinning check. You address it with the Backoff
, but that give us this problem: If a hyperdex event appears in the middle of a backoff sleep, we will not wake up until the backoff is over (so we get latency). If we used hyperdex_client_loop
in a blocking fashion, this problem would go away because it will wake up immediately. Your concern that hyperdex might also use a spinning lock inside is valid, but it does not seem so: It uses epoll
inside, so the lock will not spin using CPU (I clarified that on IRC, see chat log further down).
What remains is your other concern that GHC can't escape from a blocking call in general, e.g. if you throw killThread
to a thread which is in a function call or use the timeout
function on it, the thread will only die after the foreign call is done (which can be very long for a syscall that waits for some event), so that's undesireable. This problem though only exists if we use safe
or unsafe
foreign imports - if we use interruptible
(what I suggest), then killThread
/timeout
/any async exception will immediately terminate the foreign call. interruptible
must only be used on foreign functions that implement proper signal handling by checking for EINTR
, but hyperdex_client_loop
does, so it fulfills this requirement.
By removing Backoff
and marking hyperdex_client_loop
interruptible
, we can get all goals:
The overhead of a safe ffi call is ~ 250ns, which is negligible.
On my system, the overhead is even less (3 years old Core i5):
safe call: 7ns
unsafe call: 90ns
I made this gist to measure it.
Because of this, setting a timeout on hyperdex_client_loop amounts to a busy wait
According to @rescrv, there is no busy wait (epoll
is used):
nh2: I disconnected, in case anybody answered my question rescrv: nh2: I was waiting for you to rejoin to answer. rescrv: I'm reading the thread now. rescrv: I'm not sure where Aaron is getting the busy wait information, but it sounds like he's using hyperdex_client_loop incorrectly. rescrv: hyperdex_client_loop will wait within epoll_wait (or appropriate substitutes on non-linux platforms) and doesn't busy wait rescrv: but because he's passing a timeout of 0, it's effectively polling and never blocking, leaving him to wait on everything. The only way I can figure he would do so is with a sleep. rescrv: Your comment is correct, but you'll need to block all signals if the haskell runtime does indeed generate them. We'll unblock in every poll. Everywhere else, we have real work to do and can proceed uninterrupted until we return to your application. rescrv: does that make sense? rescrv: if need-be, we can add a "int hyperdex_client_poll(...)" call so that you can only call loop when hyperdex_client_loop(timeout=0) will do some work. That way you won't ever block, and can integrate it into other event loops rescrv: blackdog: since it's your bug, you may want to see the above too nh2: rescrv: yes, that makes a lot of sense rescrv: we also can selectively mask signals so you get HYPERDEX_CLIENT_INTERRUPTED for only a subset of signals rescrv: It all depends on what you need to work with the haskell runtime rescrv: That's an area where I'll have to defer to others nh2: rescrv: I don't think you need to add anything. When you compile with -threaded in GHC and fork a thread, that thread should not get GHC's periodic timer signals (afaik only the main thread gets it), so we can safely do the foreign call. rescrv: are you using a background thread for all interaction (using a queue of some sort), or just for loop? nh2: rescrv: I'm not sure if the current implementation does that, but it should be straigtforward rescrv: good to hear. getting the haskell bindings up to par with others would be good rescrv: nh2 blackdog: Here's what I use to auto-generate tests for various operations: https://github.com/rescrv/HyperDex/blob/master/maint/generate-bindings-tests.py rescrv: because of the structure of the code (e.g., put, atomic_add, etc use the same code underneath), it's easy to test a good portion of bindings with those test cases. rescrv: It's kind of a hack, but I generate java, ruby, python all from the code at the end of that file nh2: the other cases are those where we want to be able to interrupt hyperdex_client_loop (e.g. when we want to kill a Haskell thread with threadKill). That's what that
foreign ccall interruptible
I raised is about. If you use a normal Haskell foreign call, and you use threadKill or, say, Ctrl-C in the meantime, the handling of that will be deferred until the foreign call returns (which can lead to your Ctrl-C only taking effect much later than you would like). With aninterruptible
call, Haskell will immediately interrupt the called C code, but it requires that the C function can deal with being aborted and tell that it was rescrv: It's not meant to test all functionality, but gets you 99% to a good binding. nh2: I think this is given because hyperdex_client_loop can notify us that it was interrupted with HYPERDEX_CLIENT_INTERRUPTED rescrv: nh2: just make sure to set the signal mask before going into loop nh2: rescrv: ah, test generation, I was recently wondering if you have that. Great that you do! rescrv: Based on testing of the different cases we had trouble with, it covers enough that I'm happy with it for now. If you find any other cases while working on HyHac, I'm happy to merge them in. nh2: that's nice rescrv: eventually I'd like to merge hyhac to bindings/haskell in the hyperdex repo if all contributors permit (it can still develop out of tree, but distributing the bindings with the code is a big plus). nh2: good idea. Probably that even allows us to set up a combined Travis CI job rescrv: that'd be welcome. We run a private buildbot, mainly because I haven't taken the time to figure out how to lock it down nh2: do I guess that right that HYPERDEX_CLIENT_INTERRUPTED is mainly a forward of EINTR from epoll/select/whatever? sobacin left the room (quit: K-Lined). rescrv: that's the intent rescrv: the logic for passing up an INTERRUPTED is just in place around epoll/whatever, so you need to mask everywhere else. nh2: rescrv: I still don't exactly get what benefit you get from masking all signals before calling into hyperdex (or what problems you get if not). Can you explain a bit more? rescrv: So I don't remember the rationale 100%, but it boiled down to reducing the complexity of the implementation. Too many calls could fail with EINTR, and IIRC there was ambiguity about how different platforms would return it. rescrv: To reduce that complexity, I decided that either apps would treat signals as fatal (e.g., neither side does anything because it's assumed that signals are fatal), apps would not send signals to the HyperDex thread, or apps would mask signals and we would return EINTR. rescrv: I think at least part of it was that there were cases where a signal would divide two syscalls that had to happen, and adding complexity to defer the second would have complicated things. rescrv: I don't know that it's still the case, so I'm willing to reconsider our signal handling if we move to a truly better position. rescrv: even if it's to just put the mask inside each call blackdog: rescrv: I'd be happy to have the haskell bindings in the main repo. rescrv: blackdog: me too. I just want to fix any perf issues. We merged the Go bindings a little early, and I don't want to make that mistake twice. blackdog: yes, of course :) just giving permission for my code (though I think it's BSD licensed anyway) blackdog: but yeah, i think ghc's runtime uses signals quite heavily nh2: blackdog: this page summarizes it nicely if you are interested (and in case you don't know it already): https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Signals blackdog: nh2: ah, that's a good resource. cheers. nh2: rescrv: ok. Still a few questions: nh2: how do you mean "Too many calls could fail with EINTR"? I mean this can only happen when a signal comes in - where should that signal come from? rescrv: I meant that all the code behind hyperdexclient* is prone to using syscalls that could fail. At the time I made the decision to mask like we do, there were many cases that would have been complicated by checking EINTR and having to fast fail. The alternative would have been to hide future errors or drop eintr nh2: rescrv: but they would only fail with EINTR if interrupted by a signal, right? In a normal C/C++ application, where should those signals come from? rescrv: nh2: From the application using HyperDex. We register no signals and generate no signals. But it would be an imposition on applications to forbid signals. rescrv: So we wrote code that can handle signals with a small amount of application effort for only those apps that need to handle them. nh2: rescrv: ah, so you mean "handling signals inside the hyperdex lib code is difficult so we prefer if applications mask them before calling us so that the cases in which we don't handle them cannot arise"? nh2: "and we have some code (like around epoll) where we do handle them, in those places we unmask them ourselves"? rescrv: nh2: 100% correct rescrv: and it made sense at the time, but I'm willing to reconsider making the whole thing signal-safe so that masking is unnecessary rescrv: the original code when we made this decision was much more complex, and it made sense rescrv: now I think it may not make sense. nh2: rescrv: yes, I think that in the long term it makes sense to make the library signal-safe nh2: rescrv: another question: I don't understand yet what you meant with "either apps would treat signals as fatal (e.g., neither side does anything because it's assumed that signals are fatal)". 1) for the calling application, it can still decide if signals are fatal or not (hyperdex has no real effect on that, or does it?), and 2) how can the hyperdex side assume they are "fatal" given that they are now blocked before it's called (so it will never see any signal it doesn't unmask). Or did that mean "If the side calling hyperdex lib code allowed a signal to come through to hyperdex, that would probably be fatal [for hyperdex]? (because it wouldn't handle the signal and would do something arbitrary or terminate)" rescrv: I guess a better way to have said that was, "If you don't mask signals in the way we suggest, then you'll get what you get (including occasional errors)." I extrapolated that to mean that the only meaningful result would be to have the signals be fatal. rescrv: This conversation has sparked me to look into making the library signal safe and leaving it at that
So it looks like we don't need Backoff
and can safely use the timeout provided by hyperdex_client_loop
.
safe call: 7ns unsafe call: 90ns
Forgive me, but I am suspicious of unsafe calls having greater overhead than safe.
As I've said before, I don't think Backoff is necessary anymore no matter what I do. I intend to rigorously test a variety of things before definitely coming to that conclusion, however.
Forgive me, but I am suspicious of unsafe calls having greater overhead than safe.
I'm sorry, I flipped the order. I meant
safe call: 90ns
unsafe call: 7ns
Ah, I was wondering if GHC was being particularly tricky with your implementation and had done some optimization that made subsequent calls faster. One can never tell with GHC these days without looking at the assembly. :)
As an update, @nh2 - one reason to have a Backoff
1 would to be to prevent starving access to other threads trying to work with the same HyperDex connection. The hyperdex_client_loop
function has to be synchronized. Any non-zero wait time means that it will likely loop at least once using epoll
as you suggested, but we'll still be holding the MVar
for the client.
I'm currently creating a new Internal.Connection.Core
and associated operations that will provide a single place to implement different concurrency schemes and centralize the implementation of the wrapper around the BusyBee-esque event loop of HyperDex. The Admin and Client modules will then just implement a type class and export specialized versions of connect/disconnect/etc. Everything will be appropriately marked INLINABLE
or INLINE
as necessary.
1 - Of course, not with its current implementation. As discussed above, the current implementation is less than ideal in its behavior.
I'm bumping this to milestone 0.13 - there are non-performance related issues that will be solved first. If I'm lucky, the performance right off the bat with Internal.Connection.Core
will be stellar and nothing will need to change.
Pushing this back because it's blocked on milestone 0.12.
I have implemented a lock-free1 with unified-backend commit 96476b03bd794905082575ce08a1968137ec12ec. You can see there that instead of an MVar, there is now a hyhacLoop
that mirrors loop
. There are a few reasons for this:
Resource management is strangely easier to reason about. At least, it is for
me - though I might need to more greatly document how it works. The test
version in ./sandbox
is a prototype that runs against an emulated HyperDex
client library. The tl;dr:
i. The HyperDex call begins with defining a setup (preamble), a C call that
requires a pointer (and uses pointers allocated in the preamble), and a
callback. The preamble uses ResourceT
to perform allocations, and the
callback does too. The callback's peek
methods also live in ResourceT
,
and responsibly clean themselves up after each call.
ii. This call is wrapped using wrapDeferred
or wrapIterator
, which specify
how to handle the callback. This wrapper actually rewrites the call and
generalizes it. It also establishes the communication channel between the
callback function and the async result value or stream. It then pushes that
Wrapped
call onto a queue to process. The wrapper also registers
finalizers in ResourceT
so that if something blows up, a sane return
value bubbles up in the client.
iii. The hyhacLoop
will read in the call, and invert the ResourceT. This
Ouroboros maneuver with tryUnwrapResourceT
allows the loop to take
ownership of resources allocated in the preamble of the original call.
Those allocations are safe to use as long as the loop lives. When the
HyperDex loop
returns a handle, hyhacLoop
will synchronously call the
registered callback, which pushes data back to the client and determines
if the handle expires.
hyperdex_client_loop()
. Also, the latest versions of HyperDex expose a
file descriptor that can be select
ed or epoll
ed or kqueue
ed. So
eventually Hyhac will be event-driven instead of poll-driven. That should
improve performance.Core
can be reused by Client
and Admin
. A
minor victory, but since implementing it correctly once is hard enough, I'd
like to get it right in one place and not two.1 - Really, absolutely not true. Completely false. GHC uses plenty of locks - but the important thing is that only one thread in Hyhac will access a given pointer at a time.
I have a benchmark using hyhac and I'm getting some weird performance bugs.
When I have one daemon on my laptop, it costs me 23ms to insert 100 11k documents, which is perfectly adequate to my needs. However, when I add a second daemon on the same box, performance drops by 500 times (criterion tells me it's going to take 1600s where with one daemon it takes 30s for the whole test).
I've run some python tests and it works out roughly the same (sometimes faster) with two daemons than with one. Any ideas? My bench code is pretty rough, but I can put it up if you need it.