Shopify / semian

:monkey: Resiliency toolkit for Ruby for failing fast
MIT License
1.34k stars 79 forks source link

Fault injection #187

Open kirs opened 6 years ago

kirs commented 6 years ago

I've been reading about fuse, a mature circuit breaker library for Erlang (a platform known for "resiliency by default").

In circuit breakers configuration, they have two fuse types (you can think of them similar to toxics in Toxiproxy):

IMO, the idea of injecting faults through a circuit breaker is brilliant. Not every organization has adopted chaos engineering yet, but this could be a first step towards that, at least on the application level.

We should think about adopting this idea in Semian. The biggest concern would probably be development environment vs production: do we inject faults when it's running locally or on CI? If yes, how do we prevent test flakiness? Or should we do this only in production?

kirs commented 6 years ago

@BoGs @sirupsen @jpittis @mac-adam-chaieb thoughts?

moechaieb commented 6 years ago

Neat idea. It could also be implemented as a Toxiproxy middleware in production, if we want to keep this feature outside of Semian.

From conversations in Slack, I gather that one of the biggest painpoints of setting up Semian is coming up with the right configuration, which has to be done by hand. This introduces more complexity to that. Do you think this would be a concern?

jacobbednarz commented 6 years ago

I like this, alot ❤

IMO, the idea of injecting faults through a circuit breaker is brilliant. Not every organization has adopted chaos engineering yet, but this could be a first step towards that, at least on the application level.

I agree this is a great intermediate step between using Toxiproxy in CI and full blown chaos engineering in production which might make the transition to the latter a touch easier.

One thing to note is that in our setup, we instrument semian quite a bit. How many times the circuit breaker has tripped and for how long are two (big) things that we're interested in watching as that gives us an insight into underlying health of the systems we are relying on and allows us to focus on particular subsystems should impacting trends emerge. If this was rolled in, we'd definitely need a way to flag the "blown fuse" as intentional and differentiate that to the real issues that have triggered the circuit breaker.

The biggest concern would probably be development environment vs production: do we inject faults when it's running locally or on CI? If yes, how do we prevent test flakiness?

Personally, I would advocate to restricting this to production as I think rolling this into the CI pipeline would cause quite a bit of frustration and developer pain. Instead, rely on Toxiproxy to test understood failure points in CI and then the blown fuse semian functionality in production. You could use these two in conjunction where new cases that are found in production could be ported to Toxiproxy CI tests.

If having this functionality in CI was a must have, an alternative would be to break it out into it's own pipeline that is not on the developer path to production but still sets off warning lights with unhandled failures. This pipeline could be built to expect failure and perhaps allow N random blown fuses before it disables the functionality and allows the run to complete aiming for a 100% green test rate at the end.

From conversations in Slack, I gather that one of the biggest painpoints of setting up Semian is coming up with the right configuration, which has to be done by hand.

Very interested to hear if there are (even hacky :P) scripts getting around that might aid in getting people up and running with this configuration to lower the barrier to entry.

jpittis commented 6 years ago

I've always thought that fault injection in production is chaos engineering, not "on an intermediate step".

Regularly injecting failure in production only has value if teams react to these failures by improving their jobs / workflows resilience.

I wonder if building a self serve "shitlist" would let teams first attempt to make their logic resilient to failure and then toggle their logic on the shitlist to turn on fault injection.

Once we had a shitlist, prod-eng could begin to enforce certain fault-injection rates around certain sections of the application.

I'm 100% down try this out and would be super excited to work on this over hack days.

IMO tests are not the right place to inject fault. Unless you're doing something like property testing, tests are better left deterministic.

jacobbednarz commented 6 years ago

You're right @jpittis - I've just done a poor job articulating that my intention with that comment was that I didn't not consider it chaos engineering; just that it wasn't a solution like Netflix's Chaos Monkey whereby containers or entire instances are randomly terminated. Having fault injection via a circuit breaker would be less of an impact since it's already a partial gate to another system.

sirupsen commented 6 years ago

This is similar to what we did back in 2014 for Resiliency, albeit through mocks, not the application library. However, it assumes that the circuits are perfectly implemented and that the client does the right thing always. I.e., you're testing the application logic, not the client driver—which is very likely to have bugs, e.g. ActiveRecord had several we found with only Toxiproxy. I've said before that those mocks covered up about as many bugs as they found. This is a bit different because we now have a nicer abstraction level to do it at than a mock—but the foundational thing stands, since you're not testing the client and we can't trust them.

We found enough bugs at that layer that circumventing the client is more pure—but I think Toxiproxy is more pragmatic.