How would you deploy the worker and handle reloads and interruptions in the middle of a workload
Workload should spread evenly across workers
Benchmark with cost 10
Questions
How to drain?
In a deploy/drain scenario is there a timeout? do we bound to only allow a single deploy shuffle?
ie v1 is out
v2 deployed
v1 drain is sent, no new connections allowed
v2 accepting all new client connections
do we wait then shutdown ? v1? do we let v1 until all ? need some sort of timeout
do we allow for a deployment of v3 while v1 and v2 are still shuffling?
this is the only strategy I've used in production for deploys, deploy blue green, begin acceptin on new, drain old, shutdown old with timeout.
At some point we need the deploy to "end", and I kind of feel that it's up to the client to be defensive and retry within reason, this way the LB and deploy is reasonable in what it offers as minimal downtime, but the client can weather the event that it gets timed out and connection is terminated. How does this work for persistent connections?
Decision Log
Used prometheus because that was most familiar and I already had a local docker-compose infrastructure wired up from previous blog posts
Started with just the netgative crypt case to test the idea and deliver a fully integrated thin line through POC
Used json over http because it 's so easy in Go and accessible and widely known/understood, but would probably look at GRPC for an internal service because of the strict contracts, and the autogenerated clients
Skipped TLS because we could add it using a service proxy like envoy; even though data is unencrypted from envoy envoy is run as a sidecar on the local machine, so TLS would still be terminated on the worker machine, not sure if this flies or not
Load Shedding strategy? Worker has a worker pool also have hard timeouts on each HTTP request
Didn't touch on alerting: what are the service SLO's? how to we alert when we are in danger of breaching those? What interval are they ? how do we model budget as an alert?
Glossed over worker unit testing, focused on getting everything wired up and flushing out unknown uknowns
Worker ignored signal handling or actual process lifecycle in order to deliver POC
Skipped CI (like travis)
JS Client Error handling is not fully featured right now and needs to be resisted to better understand the hTTP client return values and how to best express those to the client.
Put the javascript client in the same repo for simplicity
Problem Statement
Questions
How to drain?
v1
is outv2
deployedv1
drain is sent, no new connections allowedv2
accepting all new client connectionsv1
? do we letv1
until all ? need some sort of timeoutv3
whilev1
andv2
are still shuffling?Decision Log