apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.5k stars 1.16k forks source link

Enable capability to disable a single invoker for maintenance #3681

Open mdeuser opened 6 years ago

mdeuser commented 6 years ago

In an OpenWhisk deployment with multiple invokers, there may be an occasion to perform maintenance on a single invoker machine without wanting to bring down the entire OpenWhisk system.

Steps might look something like:

related issues

2678

dubee commented 6 years ago

@markusthoemmes

markusthoemmes commented 6 years ago

I agree we should have this but I'd formulate this more generally to be a "graceful shutdown".

The main work to be done here is: Shutdown the actor system while waiting for everything to be finished. Ideally the kafka queue should be drained as well, but I'd be okay with skipping that bit for the first impl.

Maintenance mode can be implemented by giving the invoker an API (HTTP or JMX) to be able to control the ping-sender. Not sending pings will effectively result in maintenance mode.

dgrove-oss commented 6 years ago

Once we have the functionality in the invoker, we could connect it to kube's PreStop hook for the container lifecycle (https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/) to enable clean shutdown of pods during kube-initiated maintenance.

mdeuser commented 6 years ago

sounds good. i want to suggest that if the "maintenance mode" invoker is restarted, it should still be in "maintenance mode" receiving no work. only until it is explicitly taken out of maintenance should it be ready for work.

dubee commented 6 years ago

@markusthoemmes, after some experimenting with terminating an invoker actor system via REST endpoint, I noticed that activation records do not get posted to the database for user containers that are still running after the actor system has been shutdown. Perhaps a better implementation would consist of directly telling the controller an invoker is in maintenance mode, and to take that invoker out of the scheduling system.

Sample code: https://github.com/dubee/openwhisk/commit/8a60ee3ddd3e619874966da438774bcff6447e5e

dubee commented 6 years ago

Another option might be to disable the Scheduler in the invoker to stop it from sending the ping messages back to the controller.

dubee commented 6 years ago

Was able to stop only the actor responsible for pinging the controller, which allows for activation records to be posted to the database for user containers that are still running when the actor is stopped.

Sample code: https://github.com/dubee/openwhisk/commit/40d655b327a6a2d47343dd08692f0f4d699c7329