awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
998 stars 230 forks source link

Scaling with min_worker=0 removes worker but process is still up (and even restarting itself) #895

Closed davidas1 closed 2 years ago

davidas1 commented 4 years ago

I'm trying to use multi-model-server to serve multiple GPU models on a single machine.

The idea is to load models until GPU memory runs out, and then scale-down and scale-up workers based on the requests. The problem is that when I send a command such as: curl -X PUT "http://127.0.0.1:8080/models/model_1?min_worker=0&max_worker=1" It looks like the worker is deleted from MMS:

$ curl -X GET http://127.0.0.1:8080/models/model_1
{
  "modelName": "model_1",
  "modelUrl": "/opt/ml/model/model_1",
  "runtime": "python",
  "minWorkers": 0,
  "maxWorkers": 1,
  "batchSize": 1,
  "maxBatchDelay": 100,
  "loadedAtStartup": false,
  "workers": []
}

But the worker process is still alive, as evident by looking at nvidia-smi and the GPU memory consumption. Even when I try to force kill the PID I see in nvidia-smi, the worker is restarted again, but not registered in MMS, so when I invoke the model I get:

$ curl -X POST http://127.0.0.1:8080/models/model_1/invoke -T ~/kitten.jpg
{
  "code": 503,
  "type": "ServiceUnavailableException",
  "message": "No worker is available to serve request: model_1"
}

If I instead do something like: curl -X PUT "http://127.0.0.1:8080/models/model_1?min_worker=0&max_worker=0" and then: curl -X PUT "http://127.0.0.1:8080/models/model_1?min_worker=1&max_worker=1"

The process is killed, but it looks like the worker cannot scale-up again:

2020-01-28 14:51:59,579 [INFO ] epollEventLoopGroup-4-1 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-ffb00f1f Worker disconnected. WORKER_SCALED_DOWN
2020-01-28 14:51:59,580 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Frontend disconnected.
2020-01-28 14:51:59,581 [ERROR] W-9000-model_1-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Couldn't create scanner - W-9000-model_1-stderr
2020-01-28 14:51:59,581 [ERROR] W-9000-model_1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Couldn't create scanner - W-9000-model_1-stdout
2020-01-28 14:51:59,582 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Backend worker process died.
2020-01-28 14:51:59,582 [INFO ] epollEventLoopGroup-3-2 ACCESS_LOG - /172.17.0.1:40952 "PUT /models/model_1?min_worker=0&max_worker=0 HTTP/1.1" 202 3
2020-01-28 14:51:59,582 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-01-28 14:51:59,582 [WARN ] W-model_1-1-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /home/model-server/model_handler.py:47: YAMLLoadWarning: calling yaml.load() without 
Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
2020-01-28 14:51:59,582 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/model_service_worker.py", line 174
, in start_worker
2020-01-28 14:51:59,583 [WARN ] W-model_1-1-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   conf = yaml.load(f)
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     self.handle_connection(cl_socket)
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/model_service_worker.py", line 138
, in handle_connection
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     cmd, msg = retrieve_msg(cl_socket)
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/protocol/otf_message_handler.py", 
line 36, in retrieve_msg
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     cmd = _retrieve_buffer(conn, 1)
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.6/site-packages/mms/protocol/otf_message_handler.py", line 163, in _retrieve_buffer
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise ValueError("Frontend disconnected")
2020-01-28 14:51:59,583 [INFO ] W-model_1-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Frontend disconnected
2020-01-28 14:52:19,800 [INFO ] W-9000-model_1 com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2020-01-28 14:52:19,800 [INFO ] epollEventLoopGroup-3-3 ACCESS_LOG - /172.17.0.1:40956 "PUT /models/model_1?min_worker=1&max_worker=1 HTTP/1.1" 202 0
2020-01-28 14:52:19,802 [ERROR] W-9000-model_1 com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker error
com.amazonaws.ml.mms.wlm.WorkerInitializationException: Failed to connect to worker.
        at com.amazonaws.ml.mms.wlm.WorkerThread.connect(WorkerThread.java:355)
        at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:207)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:connect(..) failed: Connection refused: /home/model-server/tmp/.mms.sock.9000
        at io.netty.channel.unix.Socket.connect(..)(Unknown Source)
Caused by: io.netty.channel.unix.Errors$NativeConnectException: syscall:connect(..) failed: Connection refused
        ... 1 more

What is the recommended method to achieve my desired behavior? I guess I can unregister and register the models instead of using the scaling feature, if all else fails...

vdantu commented 4 years ago

At this point in time, max_worker is a placeholder. It doesn't affect the behavior of the number of workers running in the system. Just use the min_worker option.

Using min_worker equal to 0 would remove all the model workers in the server. Since there is no auto-scaling of workers, you would have to use min_worker option to scale up and scale down the worker. In other words, if you want 5 workers use PUT /models/{model-name} with min_worker=5 and if you want to scale-down to 2 workers, send PUT /models/{model-name} with min_worker=2.

Regarding your output of GET /models, there are no workers on the host. Its not clear what you tried to kill with nvidia-smi. My assumption is its non-mms process. Even if you try to run a model on the GPU, the backend worker would be running on the CPU . This backend worker loads the model onto a GPU. Please share the output of nvidia-smi to check this further.

Regarding the exception above, it seems like there is a Yaml warning

2020-01-28 14:51:59,582 [WARN ] W-model_1-1-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - /home/model-server/model_handler.py:47: YAMLLoadWarning: calling yaml.load() without 
Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

Maybe this has something to do with why the backend worker is getting killed. Is this coming from your model code?

davidas1 commented 4 years ago

Thanks for the detailed response, for now I solved my issue by registering/unregistering models instead of scaling. I still think that the issue with min_worker=0 should be looked at, because it leaves resources that are not released for some reason.

maaquib commented 2 years ago

Fixed as part of #915