awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
984 stars 230 forks source link

Make healthcheck dependent on models loaded #1027

Open wojciechrauk-plutoflume opened 2 months ago

wojciechrauk-plutoflume commented 2 months ago

Good morning,

I'm using the MMS server to host a single model in AWS Sagemaker. The model is loaded on MMS startup with sagemaker_inference.model_server.start_model_server with a custom handler_service.

The problem is that loading the model can take significant amount of time (up to 20 minutes). In the time that model is being loaded MMS server /ping endpoint returns 200 OK responses, causing sagemaker endpoint to assume the model is already loaded.

Furthermore, I tested MMS locally to see how the server behaves in the model loading stage and it looks like the model details endpoint returns model worker status as READY before the worker is finished loading the model.

In the logs below you can see that workers start and immediately there's:

2024-04-23T11:01:33,313 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_MODEL_LOADED

after that there's

2024-04-23T11:01:33,317 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED
...
2024-04-23T11:01:33,322 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED

and only finally:

2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED
...
2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED

In the meantime /ping returns 200 OK and localhost:8081/models/test returns:

{
  "modelName": "test",
  "modelUrl": "test.mar",
  "runtime": "python3",
  "minWorkers": 2,
  "maxWorkers": 2,
  "batchSize": 1,
  "maxBatchDelay": 100,
  "loadedAtStartup": true,
  "workers": [
    {
      "id": "9000-57e7cf49",
      "startTime": "2024-04-23T11:01:33.316Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 13099008
    },
    {
      "id": "9000-be19b2af",
      "startTime": "2024-04-23T11:01:33.317Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 13434880
    }
  ]
}
All logs: ```log Removing orphan pid file. WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2024-04-23T11:01:33,126 [INFO ] main com.amazonaws.ml.mms.ModelServer - MMS Home: /home/ec2-user/SageMaker/notebooks/environment/user/user-env-base-3-10/user-env-base-3-10/pex_venv/lib/python3.10/site-packages Current directory: /tmp/models/test Temp directory: /tmp Number of GPUs: 0 Number of CPUs: 2 Max heap size: 978 M Python executable: /home/ec2-user/SageMaker/notebooks/environment/user/user-env-base-3-10/user-env-base-3-10/pex_venv/bin/python3.10 Config file: config.properties Inference address: http://127.0.0.1:8080 Management address: http://127.0.0.1:8081 Model Store: /tmp/models/test/export Initial Models: test.mar Log dir: null Metrics dir: null Netty threads: 0 Netty client threads: 0 Default workers per model: 2 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Preload model: false Prefer direct buffer: false 2024-04-23T11:01:33,133 [INFO ] main com.amazonaws.ml.mms.ModelServer - Loading initial models: test.mar preload_model: false 2024-04-23T11:01:33,189 [INFO ] main com.amazonaws.ml.mms.archive.ModelArchive - model folder already exists: f05f8ac8472b74dad6310273915eb6a5de447740 2024-04-23T11:01:33,217 [WARN ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-test 2024-04-23T11:01:33,309 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /tmp/.mms.sock.9000 --handler /tmp/models/test/model_handler.py:handle --model-path /tmp/models/f05f8ac8472b74dad6310273915eb6a5de447740 --model-name test --preload-model false --tmp-dir /tmp 2024-04-23T11:01:33,311 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /tmp/.mms.sock.9000 2024-04-23T11:01:33,311 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 30549 2024-04-23T11:01:33,312 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started. 2024-04-23T11:01:33,313 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.10.12 2024-04-23T11:01:33,313 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_MODEL_LOADED 2024-04-23T11:01:33,314 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:106|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,314 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model test loaded. 2024-04-23T11:01:33,315 [DEBUG] main com.amazonaws.ml.mms.wlm.ModelManager - updateModel: test, count: 2 2024-04-23T11:01:33,317 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED 2024-04-23T11:01:33,320 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2024-04-23T11:01:33,322 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED 2024-04-23T11:01:33,334 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9000 2024-04-23T11:01:33,333 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9000 2024-04-23T11:01:33,415 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://127.0.0.1:8080 2024-04-23T11:01:33,416 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2024-04-23T11:01:33,426 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9000. 2024-04-23T11:01:33,429 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9000. 2024-04-23T11:01:33,441 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081 Model server started. 2024-04-23T11:01:33,480 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet. 2024-04-23T11:01:33,553 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model initialising, waiting for 2 minutes 2024-04-23T11:01:33,554 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model initialising, waiting for 2 minutes 2024-04-23T11:01:33,592 [INFO ] pool-3-thread-1 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,598 [INFO ] pool-3-thread-1 MMS_METRICS - DiskAvailable.Gigabytes:48.51470947265625|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,599 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUsage.Gigabytes:86.47355270385742|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryAvailable.Megabytes:2041.51171875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUsed.Megabytes:1586.08984375|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,602 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUtilization.Percent:47.8|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:02:33,459 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet. 2024-04-23T11:02:33,528 [INFO ] pool-3-thread-1 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,529 [INFO ] pool-3-thread-1 MMS_METRICS - DiskAvailable.Gigabytes:48.51460266113281|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,529 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUsage.Gigabytes:86.47365951538086|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryAvailable.Megabytes:2014.546875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUsed.Megabytes:1613.0546875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,531 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUtilization.Percent:48.5|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:03:33,445 [WARN ] pool-3-thread-2 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet. 2024-04-23T11:03:33,515 [INFO ] pool-3-thread-2 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,518 [INFO ] pool-3-thread-2 MMS_METRICS - DiskAvailable.Gigabytes:48.51456069946289|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,519 [INFO ] pool-3-thread-2 MMS_METRICS - DiskUsage.Gigabytes:86.47370147705078|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,520 [INFO ] pool-3-thread-2 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,521 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryAvailable.Megabytes:2015.05078125|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,523 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryUsed.Megabytes:1612.55078125|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,523 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryUtilization.Percent:48.4|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,653 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model finished waiting 2024-04-23T11:03:33,655 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model test loaded io_fd=0242b6fffe39090c-00007712-00000001-85c1e7725a6e3c25-be19b2af 2024-04-23T11:03:33,656 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model finished waiting 2024-04-23T11:03:33,655 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 120099 2024-04-23T11:03:33,656 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model test loaded io_fd=0242b6fffe39090c-00007712-00000002-b0c9e7725a6e3c25-57e7cf49 2024-04-23T11:03:33,655 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 120104 2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED 2024-04-23T11:03:33,658 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:120341|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,658 [WARN ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-test-2 2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED 2024-04-23T11:03:33,659 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:120343|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 ```

Questions:

  1. Is that the intended behaviour?
  2. Is it possible for /ping endpoint to return something else than 200 OK when workers are still loading the models?

thank you! :)

btruhand commented 3 weeks ago

I was just looking into this to understand what how ping and Sagemaker UpdateEndpoint would work together with MMS. As far as I can tell, the ping handler can already return non-200 (a 503 to be exact). The problem is how the running check is determined here. If you trace the code, you'd see that the check is essentially equivalent to to the first WorkerThread.getState() == WorkerState.WORKER_MODEL_LOADED.

The problem is that BOTH the server thread (the main thread/process running model_service_worker.py) and worker threads are implemented as WorkerThread. For the server thread's WorkerThread, it's state is set to WORKER_MODEL_LOADED right when the WorkerThread starts and remains that way. This corresponds with the first log line you showed

Since the model is thought to be loaded when the first WorkerThread is running instead of all of them, the ping handler quickly responds with 200, even if your model hasn't actually been loaded

I think the quickest fix on your end without a code change on MMS is to preload the model, in which case the model will attempt to be loaded when the Python backend, model_service_worker.py, is initially started

Note: if you're using this in Sagemaker, Sagemaker docs says that /ping should return a 200 within 8 minutes after container startup