Quality of Life Improvements

jbowen93 commented 7 years ago

Ideally Khaos Monkey wouldn't run outside of working hours.

In my mind that means:

Monday through Friday 9am to 5pm. Lets go PST for now but maybe the option for an env var?
Not on weekends.
Not on holidays.

There's obviously a lot of variables here but I think a starting point could be just grabbing a list of Google's US holidays and getting the M-F 9-5 working.

Future improvements could include:

Setting Time Zone via Env Var
Custom Holidays passed through a calendar
Similar to the above but a fully configurable schedule, hopefully settable via Google Calendar, obviously this brings in complexity as we now need to auth to that calendar but just an idea. In the interim we could try to come up with some kind of sane format that could be accomplished with a config map.

noahdietz commented 7 years ago

Yeah, you're right, consideration for time of day and working hours is necessary.

How much of this do you think we need to implement as configuration in the application and how much belongs in the manifest for a ScheduledJob? I think the job manifest gives you a frequency, which day, time of day, but obviously not holidays, according to the Cron convention.

jbowen93 commented 7 years ago

Ya unfortunately it's not clear to me if it's better to run this as a job or as a long lived pod...I'll read more into scheduled jobs and get back to this.

noahdietz commented 7 years ago

I think your idea of running it as a long lived pod gives us the ability to have it run at random times throughout the day...where as a ScheduledJob is just that, scheduled, so we know when itll happen and its not fun that way

noahdietz commented 7 years ago

Example CronJob for khaos-monkey that runs Every Wednesday @ 1:30pm (per Cron convention) runs for 10m w/an event every 30sec, that only kills random pods (as configured by the khaos-monkey env vars)

apiVersion: batch/v2alpha1
kind: CronJob
metadata:
  name: khaos-monkey
spec:
  schedule: "30 13 * * 4" # every Wednesday at 1:30 pm
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: khaos-monkey
            image: noahdietz/khaos-monkey:dev
            env:
            - name: KHAOS_INTERVAL
              value: "30s"
            - name: KHAOS_DURATION
              value: "10m"
            - name: KHAOTIC_EVENTS
              value: "kill-pods"
            - name: KHAOS_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: KHAOS_MONKEY_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          restartPolicy: OnFailure

noahdietz commented 7 years ago

a CronJob doesn't seem to be the answer here...too predictable, not configurable enough...if we want it running randomly within in a window of time (weekdays, working hours, as you mentioned), then i think we might need to make one of the following:

our own 3rd party resource, like a RandomCronJob or something
a long living khaos-manager pod to randomly kick off khaos-monkey jobs within the configured window of time
a single long living khaos-monkey pod that fires events randomly within the configured window (as you suggested)

jbowen93 commented 7 years ago

I'm of the opinion that a long lived pod that fires events within a configured window is the best implementation. The overhead is likely negligible. Additionally this allows the configuration to be modified easily via secrets, configMaps or an external database.

swade1987 commented 7 years ago

Some food for thought ...

Why are the date and time of execution important? Based on previous experience most "fires" start during weekends or outside of business hours. Testing during these extremes allows us to test our MTTR both inside and outside of core working hours, this is something which a lot of people miss currently.
I feel the app should run as a Deployment and execute given types of "chaos" randomly over a given period. If we back the application with a Postgres instance that stores the last time it ran each "type" of chaos and can then use that information to decide what type of chaos to run next.

jbowen93 commented 7 years ago

@swade1987 Appreciate you chiming in here.

To the first point the goal is to create a tool that can increase resiliency of applications without significantly inconveniencing engineers. If we can trigger "fires" during working hours then the root causes can be fixed during working hours. Obviously this doesn't help test MTTR during actual emergencies but ideally it can help reduce the amount of emergencies that occur. That being said if we proper implement date and time configuration then the tool could also be used to test MTTR.
I agree that the app should be run as a deployment backed by a persistent DB. Postgres may be a good first pass. Could possibly look at supporting cloud sql DBs as well, ie Google Cloud SQL, AWS RDS, Azure SQL.

swade1987 commented 7 years ago

Thanks for the feedback @jbowen93

It would also be useful to be able to force the application to run as part of the CI/CD process.

For example "just kill all pods relating to application-x"

jbowen93 commented 7 years ago

For the CI/CD process are you saying you would want to be able to always deploy an automatically configured khaos-monkey controller alongside your application?

noahdietz commented 7 years ago

Accommodating CI/CD feels like it would be best implemented as running a few pods of khaos in jobs as part of your pipeline, rather than as a long living deployment. I think supporting both is possible with a flexible enough config. Let me know if I'm off-base

swade1987 commented 7 years ago

@noahdietz that was my thinking rather than what @jbowen93 was suggesting.

noahdietz commented 7 years ago

@jbowen93 in chatting with @swade1987 briefly, we agreed a long living deployment with a configurable window of time to randomly run events in is ideal.

However, in addition, it would be interesting to run a simple server in khaos-monkey that exposes a trigger endpoint for a CI/CD pipeline to call out to when it wants to initiate & target khaotic events that coincide with an application deployment. how do you feel about that @jbowen93 ? we can open a separate issue to discuss a trigger, if necessary.

jbowen93 commented 7 years ago

I like the idea of having a server that could be called through webhooks. I think we would have to leave it up to the user to properly configure auth to the server since that feels outside the scope of khaos-monkey as a deployable container.

I'm not sure I'm completely groking the CI/CD pipeline issue. My understanding of this flow looks like this:

push to git -> build new image -> update deployment -> trigger khaos monkey

My confusion is on the trigger khaos monkey stage. Are we configuring khaos-monkey to know about a new deployment or are we trying to start some short running jobs that kill the newly deployed pods?

Please correct me if I'm way off base here.

swade1987 commented 7 years ago

@jbowen93 i am talking more about CI/CD for another service in the cluster, as part of its CI/CD pipeline you may want to kick off a certain type of "chaos" to test something.

30x / khaos-monkey

Quality of Life Improvements #9