30x / khaos-monkey

Apache License 2.0
2 stars 0 forks source link

Quality of Life Improvements #9

Open jbowen93 opened 7 years ago

jbowen93 commented 7 years ago

Ideally Khaos Monkey wouldn't run outside of working hours.

In my mind that means:

There's obviously a lot of variables here but I think a starting point could be just grabbing a list of Google's US holidays and getting the M-F 9-5 working.

Future improvements could include:

noahdietz commented 7 years ago

Yeah, you're right, consideration for time of day and working hours is necessary.

How much of this do you think we need to implement as configuration in the application and how much belongs in the manifest for a ScheduledJob? I think the job manifest gives you a frequency, which day, time of day, but obviously not holidays, according to the Cron convention.

jbowen93 commented 7 years ago

Ya unfortunately it's not clear to me if it's better to run this as a job or as a long lived pod...I'll read more into scheduled jobs and get back to this.

noahdietz commented 7 years ago

I think your idea of running it as a long lived pod gives us the ability to have it run at random times throughout the day...where as a ScheduledJob is just that, scheduled, so we know when itll happen and its not fun that way

noahdietz commented 7 years ago

Example CronJob for khaos-monkey that runs Every Wednesday @ 1:30pm (per Cron convention) runs for 10m w/an event every 30sec, that only kills random pods (as configured by the khaos-monkey env vars)

apiVersion: batch/v2alpha1
kind: CronJob
metadata:
  name: khaos-monkey
spec:
  schedule: "30 13 * * 4" # every Wednesday at 1:30 pm
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: khaos-monkey
            image: noahdietz/khaos-monkey:dev
            env:
            - name: KHAOS_INTERVAL
              value: "30s"
            - name: KHAOS_DURATION
              value: "10m"
            - name: KHAOTIC_EVENTS
              value: "kill-pods"
            - name: KHAOS_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: KHAOS_MONKEY_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          restartPolicy: OnFailure
noahdietz commented 7 years ago

a CronJob doesn't seem to be the answer here...too predictable, not configurable enough...if we want it running randomly within in a window of time (weekdays, working hours, as you mentioned), then i think we might need to make one of the following:

jbowen93 commented 7 years ago

I'm of the opinion that a long lived pod that fires events within a configured window is the best implementation. The overhead is likely negligible. Additionally this allows the configuration to be modified easily via secrets, configMaps or an external database.

swade1987 commented 7 years ago

Some food for thought ...

jbowen93 commented 7 years ago

@swade1987 Appreciate you chiming in here.

swade1987 commented 7 years ago

Thanks for the feedback @jbowen93

It would also be useful to be able to force the application to run as part of the CI/CD process.

For example "just kill all pods relating to application-x"

jbowen93 commented 7 years ago

For the CI/CD process are you saying you would want to be able to always deploy an automatically configured khaos-monkey controller alongside your application?

noahdietz commented 7 years ago

Accommodating CI/CD feels like it would be best implemented as running a few pods of khaos in jobs as part of your pipeline, rather than as a long living deployment. I think supporting both is possible with a flexible enough config. Let me know if I'm off-base

swade1987 commented 7 years ago

@noahdietz that was my thinking rather than what @jbowen93 was suggesting.

noahdietz commented 7 years ago

@jbowen93 in chatting with @swade1987 briefly, we agreed a long living deployment with a configurable window of time to randomly run events in is ideal.

However, in addition, it would be interesting to run a simple server in khaos-monkey that exposes a trigger endpoint for a CI/CD pipeline to call out to when it wants to initiate & target khaotic events that coincide with an application deployment. how do you feel about that @jbowen93 ? we can open a separate issue to discuss a trigger, if necessary.

jbowen93 commented 7 years ago

I like the idea of having a server that could be called through webhooks. I think we would have to leave it up to the user to properly configure auth to the server since that feels outside the scope of khaos-monkey as a deployable container.

I'm not sure I'm completely groking the CI/CD pipeline issue. My understanding of this flow looks like this:

push to git -> build new image -> update deployment -> trigger khaos monkey

My confusion is on the trigger khaos monkey stage. Are we configuring khaos-monkey to know about a new deployment or are we trying to start some short running jobs that kill the newly deployed pods?

Please correct me if I'm way off base here.

swade1987 commented 7 years ago

@jbowen93 i am talking more about CI/CD for another service in the cluster, as part of its CI/CD pipeline you may want to kick off a certain type of "chaos" to test something.