how to kill worker after the remote action execution?

vors commented 6 months ago

Thank you for the awesome project!

Let's say that we have the following setup:

k8s deployment with fixed (for simplicity) number of pods
each pod is running nativelink worker

It would be very useful to allow work to exit nativelink binary after a single execution -- this way I can kill the pod and it would be re-created proving a clean environment for the next action execution.

allada commented 6 months ago

Currently this is not supported.

@zbirenbaum, could you make a PR that will shutdown the worker after N number of jobs have been processed and make it configurable in the json? This should allow @vors to just set this value to 1 which solves this issue.

In the long run we are likely going to split workers into two parts (worker & executor). The executor would be super light weight and it's job is to just do book keeping. The worker would be a single process running on the same machine (required) and it's job is to prepare the environment for the executor then instruct the executor to do the actual work. By doing this we can then make a worker implementation that can talk to k8s/docker/containerd directly and just launch nativelink inside a pod on the same machine.

vors commented 6 months ago

I'd really appreciate if you can implement this proposal. That is one thing that is needed for our deployment.

zbirenbaum commented 6 months ago

Currently this is not supported.

@zbirenbaum, could you make a PR that will shutdown the worker after N number of jobs have been processed and make it configurable in the json? This should allow @vors to just set this value to 1 which solves this issue.

Sure! I'll get started on this

TraceMachina / nativelink

how to kill worker after the remote action execution? #815