odd performance - Githubissues

mwotton commented 10 years ago

I have a benchmark using hyhac and I'm getting some weird performance bugs.

When I have one daemon on my laptop, it costs me 23ms to insert 100 11k documents, which is perfectly adequate to my needs. However, when I add a second daemon on the same box, performance drops by 500 times (criterion tells me it's going to take 1600s where with one daemon it takes 30s for the whole test).

I've run some python tests and it works out roughly the same (sometimes faster) with two daemons than with one. Any ideas? My bench code is pretty rough, but I can put it up if you need it.

AaronFriel commented 10 years ago

Yeah, please send me the benchmark code. I've been away and doing a lot of conference stuff this summer so I've fallen off my regular commits, but this is definitely something that needs to be fixed.

Very curious that it affects when a host has two daemons running, I'm wondering if it's a concurrency issue.

mwotton commented 10 years ago

https://gist.github.com/mwotton/a589bdfd73b9e7d87c51 and the python https://gist.github.com/mwotton/ef20e2c3575a1f80504a

I'm manually using your ./scripts/start-hyperdex.sh script and just commenting out some of the hyperdex daemon lines. Run the actual benchmark with

REPS=100 ./hyhac-benchmarks hyperdex

you might want to comment out the cassandra and sqlite stuff unless you're keen on getting a comparison.

AaronFriel commented 10 years ago

I do see exactly what you're saying though, that the estimated time goes from tens of seconds to thousands, as much as 3500+ when I start 5 daemons.

I think the difference between the Haskell and Python examples is that the Haskell one is building a very large list of lazy thunks that consumes a great deal of memory. For the Haskell library, I defaulted to async operations. In fact, in a way they're "doubly asynchronous". Most of the operations return an IO of IO of .... The outer IO action sends the request, the inner waits for its completion.

When I change the code like so, I get much better performance:

               bench "hyperdex" $ finish (\(x,_,_) ->
                                           join $              -- <- this makes the put operation synchronous, joining the inner and outer IO
                                             H.put client "phonebook" x $!!
                                               [H.mkAttribute "content"  lastname])
                                         (\actions -> do
                                             let failures = lefts actions
                                             when (failures /= []) $
                                               error ("failure in hyperdex: " ++ show failures)
                                             return ())

The result of that change is that the entire benchmark goes to a mean of 27ms, or 2.7 seconds for 100 iterations. As for why your one example causes severe performance issues when going to multiple daemons, I do not know. I'm going to experiment further.

AaronFriel commented 10 years ago

I think I am a little closer to understanding the issue, but I am not certain and my debugging/performance analysis-fu is not strong enough to confirm this, but I think it might be related to contention on the HyperClient objects stored internally when multiple daemons are used. Like, there may be additional pauses and the locks may take longer when there are multiple daemons. HyperDex requires that all access to the HyperClient be made thread-safe by the use of locks, which I implemented with MVars.

When I change your benchmark code to, instead of making every put synchronous, using a connection pool of multiple clients, like so, I actually get even better performance than before for one to five daemons. The mean time drops to 16ms from 27ms for both cases.

I added a dependency on resource-pool and imported Data.Pool, then added this line in the setup part of your main:

  hyhacPool <- createPool (H.connect H.defaultConnectInfo) H.close 4 0.5 10

And then changed your benchmark code to:

               bench "hyperdex" $ finish (\(x,_,_) ->
                                           withResource hyhacPool $ \client ->
                                             H.put client "phonebook" x $!!
                                               [H.mkAttribute "content"  lastname])
                                         (\actions -> do
                                             failures <- lefts <$> sequence actions
                                             when (failures /= []) $
                                               error ("failure in hyperdex: " ++ show failures)
                                             return ())

With that, no matter how many daemons, performance was superb.

_Edit_: Well, until I ran the benchmark again and hyhac reported a slew of failures in the form of HyperclientGarbage responses. :(

AaronFriel commented 10 years ago

With the help of ThreadScope I think I've narrowed it down to me being a victim of my own premature optimization. When the backoff setting is set to BackoffExponential what I saw on the threadscope were a very large number of 0.5 second thread delays. I had not realized that blocked the entire GHC thread, which I'm guessing caused a great deal of contention/latency in handling requests. Since I'm not using any concurrency internally, no forkIO or par, perhaps it's possible the threadDelay was affecting more than just the Haskell code, but was preventing the HyperDex client from receiving information from a daemon.

That is my working theory at least, and it seems to be borne out by the result that even the original benchmark no longer suffers from the performance issue you saw.

I cannot explain, at this time, why I saw 16ms average times on a few initial runs of the code using a resource pool, but since I am running the benchmarks in a virtual machine it could simply be variance by virtue of not running on bare metal.

nh2 commented 10 years ago

Can you explain me what the idea behind the Backoff is? hyperdex_client_loop blocks automatically for us until something new is available, does it not?

AaronFriel commented 10 years ago

I believe hyperdex_client_loop is non-blocking, in the sense that if there is nothing in the loop queue then no result will be returned. Backoff is a premature optimization on my part to ensure that I am not hitting hyperdex_client_loop too hard when trying to synchronously interact with the client library. It's premature because blocking the thread causes more problems than it solves.

nh2 commented 10 years ago

From http://hyperdex.org/doc/latest/CAPI/#sec:api:c:client:

It’s always possible to use the library in a synchronous manner by immediately following every operation with a call to hyperdex_client_loop.

and

HYPERDEX_CLIENT_TIMEOUT The hyperdex_client_loop operation exceeded its timeout without completing an outstanding operation. This does not affect the status of any outstanding operation.

This looks to me like hyperdex_client_loop will block until some event happens or the timeout is over.

AaronFriel commented 10 years ago

I always set timeout to 0 because I don't want the GHC runtime to be stuck spinning on some foreign function that it can't escape from.

nh2 commented 10 years ago

I always set timeout to 0 because I don't want the GHC runtime to be stuck spinning on some foreign function that it can't escape from.

I don't think this is necessary. Have a look at http://hyperdex.org/doc/latest/CAPI/#sec:api:c:client:signals:

10.1.12 Working with Signals

The HyperDex client library provides a simple mechanism to cleanly integrate with applications that work with signals.

Your application must mask all signals prior to making any calls into the library. The library will unmask the signals during blocking operations and return HYPERDEX_CLIENT_INTERRUPTED should any signals be received.

hyperdex_client_loop writes a hyperdex_client_returncode*, so we will get HYPERDEX_CLIENT_INTERRUPTED in case of a signal.

This means we can use foreign import ccall interruptible, and so keep the timeout while still being interruptible.

AaronFriel commented 10 years ago

I don't think we're considering the same issue and perhaps that's because right now every function import is marked safe. That may change soon, because nothing in the HyperDex API will call back into Haskell. Marking these functions as unsafe might improve performance, but it bears a cost. Foreign functions marked safe do not block a capability, but have a higher fixed overhead. Foreign functions marked unsafe block an entire capability, but have a lower fixed overhead. If I have a timeout of 0, then there are lots of yield points even in an unsafe call to hyperdex_client_loop that are available to the RTS. It also means that the time slice given to the Haskell thread will be more fair.

Tuning the timeout parameter and the call parameters is something that will have to be done experimentally. It could be, for all I know, that safe calls and a non-zero timeout cause the least amount of contention because they cause the fewest number of new OS threads to be spawned and the cost of transferring Haskell threads to new capabilities is low enough. It could be that they cause the greatest amount of contention, if the fixed overhead of the calls and the behind-the-scenes implementation of the HyperDex library causes contention between its thread and GHC's.

cartazio commented 10 years ago

safe calls don't block haskell RTS.

unless your DB call takes less than a micro second (NEVER :) ), do not use unsafe ffi calls. Please. Dont. Please.

The overhead of a safe ffi call is ~ 250ns, which is negligible.

AaronFriel commented 10 years ago

Please take all of the following with a grain of salt, I don't know all of this half as well as I should. So, take the statements of fact that follow in the context that this is merely what I believe is happening.

The DB call isn't actually a DB call, which is what makes this situation peculiar. The HyperDex library, as near as I can tell, never actually synchronously does anything. Instead, the library is a shim around an asynchronous queue that is polled by hyperdex_client_loop. The call to hyperdex_client_loop doesn't hit the network, only the local queue which is maintained by the library. I believe the whole thing wraps around BusyBee. Because of this, setting a timeout on hyperdex_client_loop amounts to a busy wait. During that busy wait, even if we have a safe FFI call, we are going to be causing phenomenal contention for the CPU. Given this, I fully expect that hyperdex_client_loop will return in less than 1 microsecond for most calls.

nh2 commented 10 years ago

Long answer coming :)

I don't think we're considering the same issue and perhaps that's because right now every function import is marked safe.

What I meant was that if we use a timeout of 0, hyperdex_client_loop returns immediately and we get a spinning check. You address it with the Backoff, but that give us this problem: If a hyperdex event appears in the middle of a backoff sleep, we will not wake up until the backoff is over (so we get latency). If we used hyperdex_client_loop in a blocking fashion, this problem would go away because it will wake up immediately. Your concern that hyperdex might also use a spinning lock inside is valid, but it does not seem so: It uses epoll inside, so the lock will not spin using CPU (I clarified that on IRC, see chat log further down).

What remains is your other concern that GHC can't escape from a blocking call in general, e.g. if you throw killThread to a thread which is in a function call or use the timeout function on it, the thread will only die after the foreign call is done (which can be very long for a syscall that waits for some event), so that's undesireable. This problem though only exists if we use safe or unsafe foreign imports - if we use interruptible (what I suggest), then killThread/timeout/any async exception will immediately terminate the foreign call. interruptible must only be used on foreign functions that implement proper signal handling by checking for EINTR, but hyperdex_client_loop does, so it fulfills this requirement.

By removing Backoff and marking hyperdex_client_loop interruptible, we can get all goals:

no CPU spinning poll (since epoll is used)
immediate response to events (since we don't sleep)
no lock-up / delays / uninterruptible FFI calls

The overhead of a safe ffi call is ~ 250ns, which is negligible.

On my system, the overhead is even less (3 years old Core i5):

safe   call:  7ns
unsafe call: 90ns

I made this gist to measure it.

Because of this, setting a timeout on hyperdex_client_loop amounts to a busy wait

According to @rescrv, there is no busy wait (epoll is used):

nh2: I disconnected, in case anybody answered my question rescrv: nh2: I was waiting for you to rejoin to answer. rescrv: I'm reading the thread now. rescrv: I'm not sure where Aaron is getting the busy wait information, but it sounds like he's using hyperdex_client_loop incorrectly. rescrv: hyperdex_client_loop will wait within epoll_wait (or appropriate substitutes on non-linux platforms) and doesn't busy wait rescrv: but because he's passing a timeout of 0, it's effectively polling and never blocking, leaving him to wait on everything. The only way I can figure he would do so is with a sleep. rescrv: Your comment is correct, but you'll need to block all signals if the haskell runtime does indeed generate them. We'll unblock in every poll. Everywhere else, we have real work to do and can proceed uninterrupted until we return to your application. rescrv: does that make sense? rescrv: if need-be, we can add a "int hyperdex_client_poll(...)" call so that you can only call loop when hyperdex_client_loop(timeout=0) will do some work. That way you won't ever block, and can integrate it into other event loops rescrv: blackdog: since it's your bug, you may want to see the above too nh2: rescrv: yes, that makes a lot of sense rescrv: we also can selectively mask signals so you get HYPERDEX_CLIENT_INTERRUPTED for only a subset of signals rescrv: It all depends on what you need to work with the haskell runtime rescrv: That's an area where I'll have to defer to others nh2: rescrv: I don't think you need to add anything. When you compile with -threaded in GHC and fork a thread, that thread should not get GHC's periodic timer signals (afaik only the main thread gets it), so we can safely do the foreign call. rescrv: are you using a background thread for all interaction (using a queue of some sort), or just for loop? nh2: rescrv: I'm not sure if the current implementation does that, but it should be straigtforward rescrv: good to hear. getting the haskell bindings up to par with others would be good rescrv: nh2 blackdog: Here's what I use to auto-generate tests for various operations: https://github.com/rescrv/HyperDex/blob/master/maint/generate-bindings-tests.py rescrv: because of the structure of the code (e.g., put, atomic_add, etc use the same code underneath), it's easy to test a good portion of bindings with those test cases. rescrv: It's kind of a hack, but I generate java, ruby, python all from the code at the end of that file nh2: the other cases are those where we want to be able to interrupt hyperdex_client_loop (e.g. when we want to kill a Haskell thread with threadKill). That's what that foreign ccall interruptible I raised is about. If you use a normal Haskell foreign call, and you use threadKill or, say, Ctrl-C in the meantime, the handling of that will be deferred until the foreign call returns (which can lead to your Ctrl-C only taking effect much later than you would like). With an interruptible call, Haskell will immediately interrupt the called C code, but it requires that the C function can deal with being aborted and tell that it was rescrv: It's not meant to test all functionality, but gets you 99% to a good binding. nh2: I think this is given because hyperdex_client_loop can notify us that it was interrupted with HYPERDEX_CLIENT_INTERRUPTED rescrv: nh2: just make sure to set the signal mask before going into loop nh2: rescrv: ah, test generation, I was recently wondering if you have that. Great that you do! rescrv: Based on testing of the different cases we had trouble with, it covers enough that I'm happy with it for now. If you find any other cases while working on HyHac, I'm happy to merge them in. nh2: that's nice rescrv: eventually I'd like to merge hyhac to bindings/haskell in the hyperdex repo if all contributors permit (it can still develop out of tree, but distributing the bindings with the code is a big plus). nh2: good idea. Probably that even allows us to set up a combined Travis CI job rescrv: that'd be welcome. We run a private buildbot, mainly because I haven't taken the time to figure out how to lock it down nh2: do I guess that right that HYPERDEX_CLIENT_INTERRUPTED is mainly a forward of EINTR from epoll/select/whatever? sobacin left the room (quit: K-Lined). rescrv: that's the intent rescrv: the logic for passing up an INTERRUPTED is just in place around epoll/whatever, so you need to mask everywhere else. nh2: rescrv: I still don't exactly get what benefit you get from masking all signals before calling into hyperdex (or what problems you get if not). Can you explain a bit more? rescrv: So I don't remember the rationale 100%, but it boiled down to reducing the complexity of the implementation. Too many calls could fail with EINTR, and IIRC there was ambiguity about how different platforms would return it. rescrv: To reduce that complexity, I decided that either apps would treat signals as fatal (e.g., neither side does anything because it's assumed that signals are fatal), apps would not send signals to the HyperDex thread, or apps would mask signals and we would return EINTR. rescrv: I think at least part of it was that there were cases where a signal would divide two syscalls that had to happen, and adding complexity to defer the second would have complicated things. rescrv: I don't know that it's still the case, so I'm willing to reconsider our signal handling if we move to a truly better position. rescrv: even if it's to just put the mask inside each call blackdog: rescrv: I'd be happy to have the haskell bindings in the main repo. rescrv: blackdog: me too. I just want to fix any perf issues. We merged the Go bindings a little early, and I don't want to make that mistake twice. blackdog: yes, of course :) just giving permission for my code (though I think it's BSD licensed anyway) blackdog: but yeah, i think ghc's runtime uses signals quite heavily nh2: blackdog: this page summarizes it nicely if you are interested (and in case you don't know it already): https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Signals blackdog: nh2: ah, that's a good resource. cheers. nh2: rescrv: ok. Still a few questions: nh2: how do you mean "Too many calls could fail with EINTR"? I mean this can only happen when a signal comes in - where should that signal come from? rescrv: I meant that all the code behind hyperdexclient* is prone to using syscalls that could fail. At the time I made the decision to mask like we do, there were many cases that would have been complicated by checking EINTR and having to fast fail. The alternative would have been to hide future errors or drop eintr nh2: rescrv: but they would only fail with EINTR if interrupted by a signal, right? In a normal C/C++ application, where should those signals come from? rescrv: nh2: From the application using HyperDex. We register no signals and generate no signals. But it would be an imposition on applications to forbid signals. rescrv: So we wrote code that can handle signals with a small amount of application effort for only those apps that need to handle them. nh2: rescrv: ah, so you mean "handling signals inside the hyperdex lib code is difficult so we prefer if applications mask them before calling us so that the cases in which we don't handle them cannot arise"? nh2: "and we have some code (like around epoll) where we do handle them, in those places we unmask them ourselves"? rescrv: nh2: 100% correct rescrv: and it made sense at the time, but I'm willing to reconsider making the whole thing signal-safe so that masking is unnecessary rescrv: the original code when we made this decision was much more complex, and it made sense rescrv: now I think it may not make sense. nh2: rescrv: yes, I think that in the long term it makes sense to make the library signal-safe nh2: rescrv: another question: I don't understand yet what you meant with "either apps would treat signals as fatal (e.g., neither side does anything because it's assumed that signals are fatal)". 1) for the calling application, it can still decide if signals are fatal or not (hyperdex has no real effect on that, or does it?), and 2) how can the hyperdex side assume they are "fatal" given that they are now blocked before it's called (so it will never see any signal it doesn't unmask). Or did that mean "If the side calling hyperdex lib code allowed a signal to come through to hyperdex, that would probably be fatal [for hyperdex]? (because it wouldn't handle the signal and would do something arbitrary or terminate)" rescrv: I guess a better way to have said that was, "If you don't mask signals in the way we suggest, then you'll get what you get (including occasional errors)." I extrapolated that to mean that the only meaningful result would be to have the signals be fatal. rescrv: This conversation has sparked me to look into making the library signal safe and leaving it at that

So it looks like we don't need Backoff and can safely use the timeout provided by hyperdex_client_loop.

AaronFriel commented 10 years ago

safe   call:  7ns
unsafe call: 90ns

Forgive me, but I am suspicious of unsafe calls having greater overhead than safe.

As I've said before, I don't think Backoff is necessary anymore no matter what I do. I intend to rigorously test a variety of things before definitely coming to that conclusion, however.

nh2 commented 10 years ago

Forgive me, but I am suspicious of unsafe calls having greater overhead than safe.

I'm sorry, I flipped the order. I meant

safe   call:  90ns
unsafe call:   7ns

AaronFriel commented 10 years ago

Ah, I was wondering if GHC was being particularly tricky with your implementation and had done some optimization that made subsequent calls faster. One can never tell with GHC these days without looking at the assembly. :)

AaronFriel commented 10 years ago

As an update, @nh2 - one reason to have a Backoff¹ would to be to prevent starving access to other threads trying to work with the same HyperDex connection. The hyperdex_client_loop function has to be synchronized. Any non-zero wait time means that it will likely loop at least once using epoll as you suggested, but we'll still be holding the MVar for the client.

I'm currently creating a new Internal.Connection.Core and associated operations that will provide a single place to implement different concurrency schemes and centralize the implementation of the wrapper around the BusyBee-esque event loop of HyperDex. The Admin and Client modules will then just implement a type class and export specialized versions of connect/disconnect/etc. Everything will be appropriately marked INLINABLE or INLINE as necessary.

¹ - Of course, not with its current implementation. As discussed above, the current implementation is less than ideal in its behavior.

AaronFriel commented 10 years ago

I'm bumping this to milestone 0.13 - there are non-performance related issues that will be solved first. If I'm lucky, the performance right off the bat with Internal.Connection.Core will be stellar and nothing will need to change.

AaronFriel commented 10 years ago

Pushing this back because it's blocked on milestone 0.12.

AaronFriel commented 10 years ago

I have implemented a lock-free¹ with unified-backend commit 96476b03bd794905082575ce08a1968137ec12ec. You can see there that instead of an MVar, there is now a hyhacLoop that mirrors loop. There are a few reasons for this:

Resource management is strangely easier to reason about. At least, it is for me - though I might need to more greatly document how it works. The test version in ./sandbox is a prototype that runs against an emulated HyperDex client library. The tl;dr:

i. The HyperDex call begins with defining a setup (preamble), a C call that requires a pointer (and uses pointers allocated in the preamble), and a callback. The preamble uses ResourceT to perform allocations, and the callback does too. The callback's peek methods also live in ResourceT, and responsibly clean themselves up after each call.

ii. This call is wrapped using wrapDeferred or wrapIterator, which specify how to handle the callback. This wrapper actually rewrites the call and generalizes it. It also establishes the communication channel between the callback function and the async result value or stream. It then pushes that Wrapped call onto a queue to process. The wrapper also registers finalizers in ResourceT so that if something blows up, a sane return value bubbles up in the client.

iii. The hyhacLoop will read in the call, and invert the ResourceT. This Ouroboros maneuver with tryUnwrapResourceT allows the loop to take ownership of resources allocated in the preamble of the original call. Those allocations are safe to use as long as the loop lives. When the HyperDex loop returns a handle, hyhacLoop will synchronously call the registered callback, which pushes data back to the client and determines if the handle expires.
The event loop in the C client library matches up to an event loop in Haskell, and there is no more backoff, yield, or trying to micro-manage use of hyperdex_client_loop(). Also, the latest versions of HyperDex expose a file descriptor that can be selected or epolled or kqueueed. So eventually Hyhac will be event-driven instead of poll-driven. That should improve performance.
The backend implemented in Core can be reused by Client and Admin. A minor victory, but since implementing it correctly once is hard enough, I'd like to get it right in one place and not two.

¹ - Really, absolutely not true. Completely false. GHC uses plenty of locks - but the important thing is that only one thread in Hyhac will access a given pointer at a time.

AaronFriel / hyhac

odd performance #6