Open jbowen93 opened 7 years ago
Yeah, you're right, consideration for time of day and working hours is necessary.
How much of this do you think we need to implement as configuration in the application and how much belongs in the manifest for a ScheduledJob? I think the job manifest gives you a frequency, which day, time of day, but obviously not holidays, according to the Cron convention.
Ya unfortunately it's not clear to me if it's better to run this as a job or as a long lived pod...I'll read more into scheduled jobs and get back to this.
I think your idea of running it as a long lived pod gives us the ability to have it run at random times throughout the day...where as a ScheduledJob is just that, scheduled, so we know when itll happen and its not fun that way
Example CronJob
for khaos-monkey
that runs Every Wednesday @ 1:30pm (per Cron convention)
runs for 10m w/an event every 30sec, that only kills random pods (as configured by the khaos-monkey
env vars)
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: khaos-monkey
spec:
schedule: "30 13 * * 4" # every Wednesday at 1:30 pm
jobTemplate:
spec:
template:
spec:
containers:
- name: khaos-monkey
image: noahdietz/khaos-monkey:dev
env:
- name: KHAOS_INTERVAL
value: "30s"
- name: KHAOS_DURATION
value: "10m"
- name: KHAOTIC_EVENTS
value: "kill-pods"
- name: KHAOS_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: KHAOS_MONKEY_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
restartPolicy: OnFailure
a CronJob
doesn't seem to be the answer here...too predictable, not configurable enough...if we want it running randomly within in a window of time (weekdays, working hours, as you mentioned), then i think we might need to make one of the following:
RandomCronJob
or somethingkhaos-manager
pod to randomly kick off khaos-monkey
jobs within the configured window of timekhaos-monkey
pod that fires events randomly within the configured window (as you suggested)I'm of the opinion that a long lived pod that fires events within a configured window is the best implementation. The overhead is likely negligible. Additionally this allows the configuration to be modified easily via secrets, configMaps or an external database.
Some food for thought ...
Why are the date and time of execution important? Based on previous experience most "fires" start during weekends or outside of business hours. Testing during these extremes allows us to test our MTTR both inside and outside of core working hours, this is something which a lot of people miss currently.
I feel the app should run as a Deployment and execute given types of "chaos" randomly over a given period. If we back the application with a Postgres instance that stores the last time it ran each "type" of chaos and can then use that information to decide what type of chaos to run next.
@swade1987 Appreciate you chiming in here.
To the first point the goal is to create a tool that can increase resiliency of applications without significantly inconveniencing engineers. If we can trigger "fires" during working hours then the root causes can be fixed during working hours. Obviously this doesn't help test MTTR during actual emergencies but ideally it can help reduce the amount of emergencies that occur. That being said if we proper implement date and time configuration then the tool could also be used to test MTTR.
I agree that the app should be run as a deployment backed by a persistent DB. Postgres may be a good first pass. Could possibly look at supporting cloud sql DBs as well, ie Google Cloud SQL, AWS RDS, Azure SQL.
Thanks for the feedback @jbowen93
It would also be useful to be able to force the application to run as part of the CI/CD process.
For example "just kill all pods relating to application-x"
For the CI/CD process are you saying you would want to be able to always deploy an automatically configured khaos-monkey
controller alongside your application?
Accommodating CI/CD feels like it would be best implemented as running a few pods of khaos in jobs as part of your pipeline, rather than as a long living deployment. I think supporting both is possible with a flexible enough config. Let me know if I'm off-base
@noahdietz that was my thinking rather than what @jbowen93 was suggesting.
@jbowen93 in chatting with @swade1987 briefly, we agreed a long living deployment with a configurable window of time to randomly run events in is ideal.
However, in addition, it would be interesting to run a simple server in khaos-monkey
that exposes a trigger endpoint for a CI/CD pipeline to call out to when it wants to initiate & target khaotic events that coincide with an application deployment. how do you feel about that @jbowen93 ? we can open a separate issue to discuss a trigger, if necessary.
I like the idea of having a server that could be called through webhooks. I think we would have to leave it up to the user to properly configure auth to the server since that feels outside the scope of khaos-monkey
as a deployable container.
I'm not sure I'm completely groking the CI/CD pipeline issue. My understanding of this flow looks like this:
push to git
-> build new image
-> update deployment
-> trigger khaos monkey
My confusion is on the trigger khaos monkey
stage. Are we configuring khaos-monkey
to know about a new deployment or are we trying to start some short running jobs that kill the newly deployed pods?
Please correct me if I'm way off base here.
@jbowen93 i am talking more about CI/CD for another service in the cluster, as part of its CI/CD pipeline you may want to kick off a certain type of "chaos" to test something.
Ideally Khaos Monkey wouldn't run outside of working hours.
In my mind that means:
There's obviously a lot of variables here but I think a starting point could be just grabbing a list of Google's US holidays and getting the M-F 9-5 working.
Future improvements could include: