Problem & Questions & Decision Log

Problem Statement

Document any piece of infrastructure needed
Ways of monitoring the service health
Client should handle failures
How would you deploy the worker and handle reloads and interruptions in the middle of a workload
Workload should spread evenly across workers
Benchmark with cost 10

In a deploy/drain scenario is there a timeout? do we bound to only allow a single deploy shuffle?
- ie v1 is out
- v2 deployed
- v1 drain is sent, no new connections allowed
- v2 accepting all new client connections
- do we wait then shutdown ? v1? do we let v1 until all ? need some sort of timeout
- do we allow for a deployment of v3 while v1 and v2 are still shuffling?
this is the only strategy I've used in production for deploys, deploy blue green, begin acceptin on new, drain old, shutdown old with timeout.
At some point we need the deploy to "end", and I kind of feel that it's up to the client to be defensive and retry within reason, this way the LB and deploy is reasonable in what it offers as minimal downtime, but the client can weather the event that it gets timed out and connection is terminated. How does this work for persistent connections?

Used prometheus because that was most familiar and I already had a local docker-compose infrastructure wired up from previous blog posts
Started with just the netgative crypt case to test the idea and deliver a fully integrated thin line through POC
Used json over http because it 's so easy in Go and accessible and widely known/understood, but would probably look at GRPC for an internal service because of the strict contracts, and the autogenerated clients
Skipped TLS because we could add it using a service proxy like envoy; even though data is unencrypted from envoy envoy is run as a sidecar on the local machine, so TLS would still be terminated on the worker machine, not sure if this flies or not
Load Shedding strategy? Worker has a worker pool also have hard timeouts on each HTTP request
Didn't touch on alerting: what are the service SLO's? how to we alert when we are in danger of breaching those? What interval are they ? how do we model budget as an alert?
Glossed over worker unit testing, focused on getting everything wired up and flushing out unknown uknowns
Worker ignored signal handling or actual process lifecycle in order to deliver POC
Skipped CI (like travis)
JS Client Error handling is not fully featured right now and needs to be resisted to better understand the hTTP client return values and how to best express those to the client.
Put the javascript client in the same repo for simplicity