Allow concurrent execution of wrapped method

jshiell commented 5 years ago

This is a revised version of #40.

Please note that I don't except this to be accepted, as it changes the concurrency guarantees around the execution of the wrapped method substantially, which could break all sorts of things for current users. But I did think it might be of interest to people using this library under workloads similar to our own, and hence thought I should suggest the change at the least.

This PR removes the locking around the wrapped method call, instead making the CircuitMemoryStorage thread-safe, and added another lock to ensure that the counter changes and failure/success behaviour are atomic. This leads to two major consequences - firstly, there's now no expectation of exclusive execution in the wrapped method; and secondly, listener calls may likewise now be concurrent.

Why this madness, you may ask? Our use case is wrapping IO heavy tasks (proxying HTTP calls for assets, in particular) and hence the coarse-but-safe locking behaviour around the wrapped method leads to a major performance impact on our servers. In this case we're very happy to allow looser guarantees to substantially increase throughput.

I've also added try/except around listener method invocations, as previously a badly behaved listener could affect the flow of the circuit breaker.

Finally, thanks very much for all of your work on this library - it's made our lives a lot easier, and is much appreciated!

jshiell commented 5 years ago

Hmm, that'll teach me to do test builds only on 3.7. I think the edge-case Travis has caught should be fixed by making the half-open test request atomic; I'll have a look at that tomorrow as I'm out of time.

danielfm commented 5 years ago

Thanks for the contribution!

Why this madness, you may ask? Our use case is wrapping IO heavy tasks (proxying HTTP calls for assets, in particular) and hence the coarse-but-safe locking behaviour around the wrapped method leads to a major performance impact on our servers. In this case we're very happy to allow looser guarantees to substantially increase throughput.

Just out of curiosity, are you running this patch in your servers? Can you give more detailed information on how faster this implementation is compared to the unpatched version? Also, if you can provide more information on your environment (i.e. are you storing state data in Redis, or using the in-memory storage, etc)

jshiell commented 5 years ago

Of course - we were running it in production. I pulled it after finding some threading edge cases, out of paranoia, and haven't been able to push it live again yet as I've been travelling for work. I'll hopefully be able to remedy that later this week.

The use case in particular is an image proxy - we have a Bottle server running via WSGi on CherryPy that uses Pillow to do on-the-fly image transformations (normally resizes, but could be blurs/crops/etc) against a number of backends. Given backends may be unreliable, and we don't wish to affect behaviour on the other backends when they are, the circuit-breaking is used around the reverse-proxied query to the backend. We're currently handling about 60-70 req/s over 15 instances with ~30% loading (we're in a shared Cloudfoundry environment, so we run at a low-load for both handling bursts and also to avoid clobbering other apps in the same cell, since we're CPU heavy and CF won't rebalance after deployment), but we found that the exclusive locking around the delegated tasks resulted in everything falling over - the HTTP server queues backed up and we started refusing requests. Which makes sense, given a lot of our time is spent waiting for the backend to reply.

I fear I don't have any concrete numbers as Grafana is refusing to give me data for the period we were running it with the original implementation, bless it.

We're just using the in-memory storage - it's just a proxy, so we try to keep them stateless and entirely independent so we can scale them as needed.

danielfm commented 5 years ago

I'm still not sure about this one; I'm currently not running any Python system with pybreaker so I could experiment this change in a more realistic setting.

Thus, my suggestion would be to leave the CircuitBreaker class as it is, and provide a different implementation (the one in this PR) as a separate class, i.e. UnsafeCircuitBreaker or something like that, so that users can choose which guarantees make the most sense for them. Or keep just the one CircuitBreaker class, but somehow allow the user to choose which behaviour to use.

jshiell commented 5 years ago

That's entirely fair - as said, I didn't expect it to be merged, as anyone relying on the current behaviour could be badly burnt.

I'll have a look at seeing if we can make the behaviour switchable when I next get a few minutes 👍

jshiell commented 4 years ago

I'm deeply ashamed to say I forgot about this. So this morning I picked it up to see what I could do with it. Only to find I'm an idiot 😄

In turns out that at some point someone (me) reverted the monkey patched changes on our app, and pointed the app at my fork. However, they (me) buggered up, and pointed at master. So the TL;DR is that we've been running our app for the better part of a year against the master branch, and it's been fine. I don't know what changed to invalidate this, whether it was something in the runtime (we're fairly aggressive on following the latest releases of Python) or whether I misdiagnosed the root cause, and it was something else in our environment that caused the contention issues. Either way, all's well that ends well, if a little embarassing for me!

On that basis, and in the firm belief that simpler code is better code, I'm going to close this. Thanks as always for all your work, and apologies for leaving this hanging so long.

danielfm / pybreaker

Allow concurrent execution of wrapped method #41