Closed davidas1 closed 2 years ago
At this point in time, max_worker
is a placeholder. It doesn't affect the behavior of the number of workers running in the system. Just use the min_worker
option.
Using min_worker
equal to 0 would remove all the model workers in the server. Since there is no auto-scaling of workers, you would have to use min_worker
option to scale up and scale down the worker. In other words, if you want 5 workers use PUT /models/{model-name}
with min_worker=5
and if you want to scale-down to 2 workers, send PUT /models/{model-name}
with min_worker=2
.
Regarding your output of GET /models
, there are no workers on the host. Its not clear what you tried to kill with nvidia-smi. My assumption is its non-mms process. Even if you try to run a model on the GPU, the backend worker would be running on the CPU . This backend worker loads the model onto a GPU. Please share the output of nvidia-smi
to check this further.
Regarding the exception above, it seems like there is a Yaml warning
2020-01-28 14:51:59,582 [WARN ] W-model_1-1-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /home/model-server/model_handler.py:47: YAMLLoadWarning: calling yaml.load() without
Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Maybe this has something to do with why the backend worker is getting killed. Is this coming from your model code?
Thanks for the detailed response, for now I solved my issue by registering/unregistering models instead of scaling. I still think that the issue with min_worker=0 should be looked at, because it leaves resources that are not released for some reason.
Fixed as part of #915
I'm trying to use multi-model-server to serve multiple GPU models on a single machine.
The idea is to load models until GPU memory runs out, and then scale-down and scale-up workers based on the requests. The problem is that when I send a command such as:
curl -X PUT "http://127.0.0.1:8080/models/model_1?min_worker=0&max_worker=1"
It looks like the worker is deleted from MMS:But the worker process is still alive, as evident by looking at
nvidia-smi
and the GPU memory consumption. Even when I try to force kill the PID I see innvidia-smi
, the worker is restarted again, but not registered in MMS, so when I invoke the model I get:If I instead do something like:
curl -X PUT "http://127.0.0.1:8080/models/model_1?min_worker=0&max_worker=0"
and then:curl -X PUT "http://127.0.0.1:8080/models/model_1?min_worker=1&max_worker=1"
The process is killed, but it looks like the worker cannot scale-up again:
What is the recommended method to achieve my desired behavior? I guess I can unregister and register the models instead of using the scaling feature, if all else fails...