Make healthcheck dependent on models loaded

Good morning,

I'm using the MMS server to host a single model in AWS Sagemaker. The model is loaded on MMS startup with sagemaker_inference.model_server.start_model_server with a custom handler_service.

The problem is that loading the model can take significant amount of time (up to 20 minutes). In the time that model is being loaded MMS server /ping endpoint returns 200 OK responses, causing sagemaker endpoint to assume the model is already loaded.

Furthermore, I tested MMS locally to see how the server behaves in the model loading stage and it looks like the model details endpoint returns model worker status as READY before the worker is finished loading the model.

In the logs below you can see that workers start and immediately there's:

2024-04-23T11:01:33,313 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_MODEL_LOADED

after that there's

2024-04-23T11:01:33,317 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED
...
2024-04-23T11:01:33,322 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED

and only finally:

2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED
...
2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED

In the meantime /ping returns 200 OK and localhost:8081/models/test returns:

{
  "modelName": "test",
  "modelUrl": "test.mar",
  "runtime": "python3",
  "minWorkers": 2,
  "maxWorkers": 2,
  "batchSize": 1,
  "maxBatchDelay": 100,
  "loadedAtStartup": true,
  "workers": [
    {
      "id": "9000-57e7cf49",
      "startTime": "2024-04-23T11:01:33.316Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 13099008
    },
    {
      "id": "9000-be19b2af",
      "startTime": "2024-04-23T11:01:33.317Z",
      "status": "READY",
      "gpu": false,
      "memoryUsage": 13434880
    }
  ]
}

All logs:

```log Removing orphan pid file. WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2024-04-23T11:01:33,126 [INFO ] main com.amazonaws.ml.mms.ModelServer - MMS Home: /home/ec2-user/SageMaker/notebooks/environment/user/user-env-base-3-10/user-env-base-3-10/pex_venv/lib/python3.10/site-packages Current directory: /tmp/models/test Temp directory: /tmp Number of GPUs: 0 Number of CPUs: 2 Max heap size: 978 M Python executable: /home/ec2-user/SageMaker/notebooks/environment/user/user-env-base-3-10/user-env-base-3-10/pex_venv/bin/python3.10 Config file: config.properties Inference address: http://127.0.0.1:8080 Management address: http://127.0.0.1:8081 Model Store: /tmp/models/test/export Initial Models: test.mar Log dir: null Metrics dir: null Netty threads: 0 Netty client threads: 0 Default workers per model: 2 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Preload model: false Prefer direct buffer: false 2024-04-23T11:01:33,133 [INFO ] main com.amazonaws.ml.mms.ModelServer - Loading initial models: test.mar preload_model: false 2024-04-23T11:01:33,189 [INFO ] main com.amazonaws.ml.mms.archive.ModelArchive - model folder already exists: f05f8ac8472b74dad6310273915eb6a5de447740 2024-04-23T11:01:33,217 [WARN ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-test 2024-04-23T11:01:33,309 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /tmp/.mms.sock.9000 --handler /tmp/models/test/model_handler.py:handle --model-path /tmp/models/f05f8ac8472b74dad6310273915eb6a5de447740 --model-name test --preload-model false --tmp-dir /tmp 2024-04-23T11:01:33,311 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /tmp/.mms.sock.9000 2024-04-23T11:01:33,311 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 30549 2024-04-23T11:01:33,312 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started. 2024-04-23T11:01:33,313 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.10.12 2024-04-23T11:01:33,313 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_MODEL_LOADED 2024-04-23T11:01:33,314 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:106|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,314 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model test loaded. 2024-04-23T11:01:33,315 [DEBUG] main com.amazonaws.ml.mms.wlm.ModelManager - updateModel: test, count: 2 2024-04-23T11:01:33,317 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED 2024-04-23T11:01:33,320 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2024-04-23T11:01:33,322 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change null -> WORKER_STARTED 2024-04-23T11:01:33,334 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9000 2024-04-23T11:01:33,333 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /tmp/.mms.sock.9000 2024-04-23T11:01:33,415 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://127.0.0.1:8080 2024-04-23T11:01:33,416 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2024-04-23T11:01:33,426 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9000. 2024-04-23T11:01:33,429 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /tmp/.mms.sock.9000. 2024-04-23T11:01:33,441 [INFO ] main com.amazonaws.ml.mms.ModelServer - Management API bind to: http://127.0.0.1:8081 Model server started. 2024-04-23T11:01:33,480 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet. 2024-04-23T11:01:33,553 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model initialising, waiting for 2 minutes 2024-04-23T11:01:33,554 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model initialising, waiting for 2 minutes 2024-04-23T11:01:33,592 [INFO ] pool-3-thread-1 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,598 [INFO ] pool-3-thread-1 MMS_METRICS - DiskAvailable.Gigabytes:48.51470947265625|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,599 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUsage.Gigabytes:86.47355270385742|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryAvailable.Megabytes:2041.51171875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,600 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUsed.Megabytes:1586.08984375|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:01:33,602 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUtilization.Percent:47.8|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870093 2024-04-23T11:02:33,459 [WARN ] pool-3-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet. 2024-04-23T11:02:33,528 [INFO ] pool-3-thread-1 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,529 [INFO ] pool-3-thread-1 MMS_METRICS - DiskAvailable.Gigabytes:48.51460266113281|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,529 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUsage.Gigabytes:86.47365951538086|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryAvailable.Megabytes:2014.546875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,530 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUsed.Megabytes:1613.0546875|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:02:33,531 [INFO ] pool-3-thread-1 MMS_METRICS - MemoryUtilization.Percent:48.5|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870153 2024-04-23T11:03:33,445 [WARN ] pool-3-thread-2 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet. 2024-04-23T11:03:33,515 [INFO ] pool-3-thread-2 MMS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,518 [INFO ] pool-3-thread-2 MMS_METRICS - DiskAvailable.Gigabytes:48.51456069946289|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,519 [INFO ] pool-3-thread-2 MMS_METRICS - DiskUsage.Gigabytes:86.47370147705078|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,520 [INFO ] pool-3-thread-2 MMS_METRICS - DiskUtilization.Percent:64.1|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,521 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryAvailable.Megabytes:2015.05078125|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,523 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryUsed.Megabytes:1612.55078125|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,523 [INFO ] pool-3-thread-2 MMS_METRICS - MemoryUtilization.Percent:48.4|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,653 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model finished waiting 2024-04-23T11:03:33,655 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model test loaded io_fd=0242b6fffe39090c-00007712-00000001-85c1e7725a6e3c25-be19b2af 2024-04-23T11:03:33,656 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model finished waiting 2024-04-23T11:03:33,655 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 120099 2024-04-23T11:03:33,656 [INFO ] W-9000-test-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model test loaded io_fd=0242b6fffe39090c-00007712-00000002-b0c9e7725a6e3c25-57e7cf49 2024-04-23T11:03:33,655 [INFO ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 120104 2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED 2024-04-23T11:03:33,658 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:120341|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 2024-04-23T11:03:33,658 [WARN ] W-9000-test com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-test-2 2024-04-23T11:03:33,657 [DEBUG] W-9000-test com.amazonaws.ml.mms.wlm.WorkerThread - W-9000-test State change WORKER_STARTED -> WORKER_MODEL_LOADED 2024-04-23T11:03:33,659 [INFO ] W-9000-test MMS_METRICS - W-9000-test.ms:120343|#Level:Host|#hostname:ip-10-10-24-132.eu-west-1.compute.internal,timestamp:1713870213 ```

Questions:

Is that the intended behaviour?
Is it possible for /ping endpoint to return something else than 200 OK when workers are still loading the models?

thank you! :)

awslabs / multi-model-server

Make healthcheck dependent on models loaded #1027