Define chaos engineering landscape

caniszczyk commented 6 years ago

Define a lanscape similar to s.cncf.io for chaos engineering

mattforni commented 6 years ago

Ping

caniszczyk commented 6 years ago

@mattforni starting working on this with @Lawouach a bit but need community input, we have a list of projects here: https://docs.google.com/document/d/1BeeJZIyReCFNLJQrZjwA4KMlUJelxFFEv3IwED16lHE/edit?ts=5ace0eab#heading=h.k8f5ndt8affu

The thing I'm struggling with is how to categorize the different solutions out there, what categories should there be? So far I have:

Security (ChaoSlinger)
Frameworks (chaos toolkit)
Hosted (i.e. gremlin)

Since I'm new to the space, it's a bit complicated. @Lawouach had some thoughts to organize things by how they are run... semi-autonomous (chaos monkey, chaos kube, gremlin etc)... agents that are started/stopped but not autonomous (like muxy). Agents that are simple orchestrators (like the chaostoolkit) but don't have the chaos knowledge themselves (they orchestrate agents).

Would love to hear your thoughts and the @chaoseng/maintainers

joaoasrosa commented 6 years ago

After giving some thought after the meeting, I like the idea from Julien Bisconti (https://github.com/veggiemonk/awesome-docker). Start defining per use case is one way to categorize the resources.

Also, is important to keep track of the project status (e. g., again using the idea in the awesome-docker repo with the icons), giving a clear indication to the community.

veggiemonk commented 6 years ago

Hi @caniszczyk,

By experience, here is what we noticed at https://github.com/veggiemonk/awesome-docker

First of all, price is usually the first thing people pay attention to. Making it very clear which are paying service and which are DIY, it is quite a defining factor.

Then, the hypothesis! What is the project helping us to test ? What does it do? Are we testing that kubernetes restarting pods when they are killed ? Are we testing the behavior of the application when the DB is unavailable ? Are we testing the alerting/monitoring system ? I feel this is the pain point because it defines the scope the project. If there is a monitoring working group it would be nice to ask them about how do they categorize projects because chaos engineering can only exist if there is monitoring and alerting.

Finally, the criteria like which cloud provider is supported, the language it's written in, are the documentation and requirements clearly outlined, the community around the project, does it run as a daemon on each node or not, and so on. Those are technical details and no decision is necessary in order to categorize projects.

In the end, with open source, technical details can change pretty quick.

I could go on and on but i'll stop here for now.

What are your views on this?

umamukkara commented 6 years ago

Hi @caniszczyk ,

From the landscape point of view, here are my observations, driven primarily by how litmus (https://github.com/openebs/litmus) came into being.

Frameworks are needed to introduce chaos into the stable system or a given work flow (CI pipelines for example) but chaos itself can be thought of multiple categories, such as application, & infrastructure (node, storage, networking). I have already seen frameworks as a landscape category, and I propose to have infrastructure chaos as another category and applications, networking and storage as sub-categories under infrastructure chaos.

veggiemonk commented 6 years ago

Hi @umamukkara,

That is a very interesting approach. Can you elaborate a bit more on what would you consider a framework as oppose to a tool ? Would a framework require some programming (code) instead of configuration (yaml/json) ?

gluckzhang commented 6 years ago

Hi @caniszczyk , Regarding the category, is it good to add "research topics" or "academic topics" as a category? Because I think nowadays we have lots of great practice in industry, there should also be some important but not urgent problems in this field to be investigated deeply. With this category, we can simply keep the current well-known frameworks, tools, methodologies in detailed category and let vague problems or research challenges go into this new category. It somehow also improves the vision of the proposal, showing that we not only pay attention to the current problems, but also interested in future challenges and innovative ideas. Hopefully the topics can also attract more academic teams on system resilience, fault injection, chaos engineering etc. They have more time for difficult problems and they must be happy to test their ideas in industry in the future. One of the best experiences for researches is starting from a new idea and finally applying it into real production :)

By the way I am a PhD student from KTH Royal Institute of Technology, my supervisor professor Martin Monperrus and I are quite interested in chaos engineering, self-healing software and anti-fragile systems. I am more than happy to contribute more to this wonderful working group!

caniszczyk commented 6 years ago

Google sheet to track things: https://docs.google.com/spreadsheets/d/1Ro7I8ckICpQfomM1AvYPDtS6RB33-L7-puhEp0Q3b9o/edit#gid=0

mattforni commented 6 years ago

Really enjoying the conversation here! Admittedly, I've been struggling to come up with a single set of classifications that encompass all of the tooling that is available today, as well as tooling I can see evolving in the coming years.

I'm a fan of @umamukkara's proposal of layering the landscape. In my opinion the top-level categorization is really the layer (or scope) at which the tool is attempting to inject failure. In perusing the list of available tooling I would suggest the following layer classifications:

cloud provider (chaos monkey, chaos-lambda)
os (gremlin)
orchestration (litmus, pod-reaper, kube-monkey)
container (blockade, pumba)
application (MockLab, Toxiproxy)

I think each of these classifications can be further defined by the type of failure modes it supports, which seem to boil down to three overarching themes:

resource (cpu, memory)
network (latency, packet loss)
system state (reboot, time)

Some of these categories or failure modes are likely not applicable depending on which layer the tool is targeting. For example, mucking with time at the container level is not particularly useful since system clock is a resource sourced at the OS level and shared across containers.

Additionally, I think @veggiemonk has a good point:

First of all, price is usually the first thing people pay attention to. Making it very clear which are paying service and which are DIY, it is quite a defining factor.

In general, pricing (read: open-source v hosted solution) often indicates the level of initial cost, time-wise, one should expect to invest as well as the level of support associated with the tool. That is, at least in the nascent days of many open-source projects, before they find a good home to take care of them.

In any case, that's just my $0.02 after giving defining the landscape a fair bit of thought. Happy to discuss during the next CNCF WG session, or async here.

umamukkara commented 6 years ago

@veggiemonk sorry for not providing more clarity. By framework I mean, users make modifications/extensions to suit their needs. Litmus is one such example. Users are expected to fork Litmus modify or add more tests and run in their CI pipelines.

@mattforni , thank you for elaborating. I really like "layer" wording. I can easily imagine Litmus being a Chaos Orchestrator with multiple test suits getting into it from various other layers. For example, one of the tests in Litmus execution flow can be a chaos test engineered by Pumba tool.

Lawouach commented 6 years ago

Interestingly, folks we speak to talk about reliability, security, performance, resiliency or (sometimes) observability of their system. Those echoes their concern often before the layer at which we can interact with the system to prove a problem. But that depends who we speak to, they come from various persepctive :)

chaoseng / wg-chaoseng

Define chaos engineering landscape #2