Granular rollback functionality discussion

alexander-gorelik commented 4 years ago

Hi,

Opening this issue that follows my discussion with @Lawouach in slack.

Intro

We are started implementing chaos experiments in our organization The benefits of those are known to all.

But there are additional top management requirements for the testing framework we are going to build.

The problem

In each moment during experiment execution, we must have a live monitor on what is going on, what is the impact on users, and never leave garbage in production caused by our steady-state or actions. Why this is important? First of all, I think this is pretty reasonable for any company. The worst of our management scenario is a user impact. We have 150M first level users, to convince management that our experiments are something they must have while we telling them that we are going to impact users during our experiment is not so easy. So we must assure them that we are controlling everything and with a certain amount of degree can recover everything.
Ability to stop the experiment at any moment and perform a rollback of things we did so far. Why we need this? See number 1 above :point_up: Let's say I have an experiment that runs for half of an hour, After 5 mins I get a slack from the Production team that something bad happened in production(not as a result of my experiment) that causes a really big impact on users and they are trying to solve it. The demand from me is to stop the execution of my experiment, they do not want to add additional chaos to prod that already suffering that might cause a total collapse. now There is no option to stop the experiment and execute the rollback Let's say we will have this option in the near future, So I executed my steady-state and one of my actions what I should rollback? Current rollback assumes that the experiment is fully run, maybe I don't need the execute the whole rollback code, just a part of it that is relevant to probes/actions I performed.
Handling any case of sudden experiment termination and the ability to recover Why we need this? See number 1 above :point_up: Let's say I have an experiment that runs for half of an hour, I execute it in a dedicated VM, suddenly amazon decided to terminate it, or you just run some server that suddenly crashes during the experiment.:sob: now CTK should not handle this kind of problems but can help with that

What can we do with the above?

If we will be able to link probes and actions to dedicated rollbacks so they will execute only if those actions/probes were executed this will give really great control over rollbacks functionality.:rocket:

{
    "steady-state-hypothesis": {
        "probes": [
            {
                "type": "probe",
                "name": "name-a"
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "name-b"
        }
    ],
    "rollbacks": [
        {
            "type": "action",
            "name": "rollback-first",
            "link": ["name-a"]
        },
        {
            "type": "action",
            "name": "trollback-second",
            "link": ["name-a","name-b"]
        },
        {
            "type": "action",
            "name": "rollback-third"
        }
    ]

Run only name-a then rollback-first and rollback-third
Run name-a name-b then all rollback will execute

I am not specifying what I going to do in my code because this post is already too long.

Thanks!

Lawouach commented 4 years ago

Hey @alexander-gorelik, so I did some exploration for this feat. request and I'll try to recap my thoughts here.

First, thank you for the extensive description, always helpful to understand the context to respond accordingly.

Let's put a bit of context. A Chaos Experiment was not designed to be a test but to sort of mimic real life incidents. In your system, you may have bulkheads that avoid an incident from triggering a larger problem. A chaos experiment is also here to document whether or not you have the right ones (or any at all) and surface evidence (or lack thereof) about their effectiveness. Auomated rollbacks make sense in a test approach but in a chaos experiment, they are a different promise. They can be considered as "undo" of the action carried during the experiment (say I removed a permission on a file to see how something reacts to this, my rollback would put the permission back in place. The rollabck should not attempt to "save the system from itself". We are to learn about our system's complexity in all its glory.

So the promise made by Chaos Toolkit is to run rollbacks (undos) when an experiment went all the way (whether deviating or not). But, a design decision was made to not apply them when the operator of the experiment purposefully stop these experiments, this can happen in two occasions:

the chaos toolkit receives a SIGINT/SIGTERM signal
a control extension raises an InterruptException

In both cases, the idea is that if the experiment was explicitely interrupted, the operator likely wants to look at the system and rollbacks could interfere in an unhelpful way.

Let's now refresh our minds that an action in an experiment has four potential paths:

it runs to completion (whatever that completion led to oin the system)
it errors for an unknown reason (think a 500 in HTTP). In this case the chaos toolkit reports it aborted.
it fails in the sense where it didn't match some internal expectations from the extension's developer view point (usually raisong ActivityFailed). In that case, chaos toolkit carries on.
it doesn't run (because the experiment exited beforehand)

With all this context, let's come back to rollbacks. They are tricky:

they can be made to do things beyond the spirit they have been designed for. The chaos toolkit has no way to understand what you intend on doing with them.
they have no relationship to a particular action in the experiment

That second point is the one raised here I believe. Right now, rollbacks are applied sequentially much like the method. But nothing in the specification tells us that a rollback (or set of rollbacks) is related to a specific action. So for now, without being explict about it (as proposed), the chaos toolkit would not know which to play.

Lawouach commented 4 years ago

So let's now talk about the potential solution here.

The following aspects must be discussed and implemented.

1. Run strategies

Classic strategy (the current)

Run all rollbacks, in the declared order, unless SIGINT/SIGTERM/InterrupuExecution

Always strategy

Run all rollbacks, in the declared order

2. Choice over which rollbacks to run

As proposed, we should be explicit about the relationship between a rollback and actions/probes elsewhere. So, let's see how this could look when the strategy is to always run:


    "method": [
        {
            "type": "action",
            "name": "action1"
        },
        {
            "type": "probe",
            "name": "probe1",
        },
        {
            "type": "action",
            "name": "action2"
        },
        {
            "type": "probe",
            "name": "probe2"
        }
    ],
    "rollbacks": [
        {
            "type": "action",
            "name": "rb1"
        },
        {
            "type": "action",
            "name": "rb2",
            "activities": ["action1"]
        },
        {
            "type": "action",
            "name": "rb3",
            "activities": ["action2"]
        },
        {
            "type": "action",
            "name": "rb4",
            "activities": ["action1", "probe2"]
        },
        {
            "type": "action",
            "name": "rb5"
        },
        {
            "type": "action",
            "name": "rb6",
            "activities": ["probe2"]
        }
    ]

Let's say the experiment went through all the way. All rollbacks will be played.

Let's say the experiment was interrupted during action2 then rb6 would not be applied because probe2 was not executed anyway. However, rb1 all the way to rb5 wil be applied because they are explicit about action1 or are simply not tied to any particular activity and therefore apply across the board. We still play rb3 even its referenced activity was interrupted because we don't know otherwise.

In other words:

rollbacks not referencing activities are always played
rollbacks referencing a single activity are always played when said activity was implemented, even if it errored, deviated or was interrupted
rollbacks referencing activities are never executed when all referenced activities were not executed
rollbacks referencing activties are always played even if not all its referenced activities are not executed themselves

This required a change in the specification (see https://github.com/chaostoolkit/chaostoolkit-documentation/issues/94) and in the chaostoolkit to add a new flag for the two different strategies.

Lawouach commented 4 years ago

I also wonder if this flag could not actually become a settings flag because it seems more a strategy you want to apply across all your runs rather than on a per run basis?

alexander-gorelik commented 4 years ago

Ok, let's start from a simple scenario number 2 in the problems described above, the "experiment stop"

I created some diagrams that we can discuss on.

So I am thinking to use a control interface before and after each probe or action and ask my DB if the "stop experiment" flag is turned on.

Chaos-hub Experiment stop

Chaos-hub Experiment stop (1)

Here described the flow we discussed. The open question what is the "Activity Executed" means, how do you count activity as executed? This question is kinda preparation for the harder problem, number 3 described above.

Chaos-hub Experiment stop (2)

Do deal with problem number 3 I need to extend my solution to use some kind of state machine, The state will be stored in DB before and after each activity or probe, so actually I will have all executed actions and probes names in DB(Do I have the actions/probes names available in Control Interface?)

The first solution I thought about is, might be a better one's :-) I should have the ability to say to CTK, Dear CTK don't run the experiment, please run only rollbacks and those are the actions/probes I executed), this will it will behave according to solution you described. WDYT?

I will add some diagrams, probably later. We need to discuss our solution with architects too.

Lawouach commented 4 years ago

Hey @alexander-gorelik, thanks so much for helpful diagrams and scenarios.

What this highlights clearly is that it isn't a straight issue or solution indeed.

I can almost see also a scenario (or a strategy now that we support them) to say "run each rollback right after its action, rather than wait until the end".

github-actions[bot] commented 3 years ago

This Issue has not been active in 365 days. To re-activate this Issue, remove the Stale label or comment on it. If not re-activated, this Issue will be closed in 7 days.

github-actions[bot] commented 3 years ago

This Issue was closed because it was not reactivated after 7 days of being marked Stale.

chaostoolkit / chaostoolkit