aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
131 stars 70 forks source link

Fix: Don't load default model in MME mode #130

Closed nikhil-sk closed 1 year ago

nikhil-sk commented 1 year ago

Issue #, if available:

Description of changes:

  1. In MME mode, no default model should be loaded. Currently, the torchserve command attempts to load a default 'model' from the path /opt/ml/models.
  2. This change removes the commandline arg based on whether the container is running in MME mode or not: Failure log
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,758 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,808 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,808 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 210, in <module>
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 181, in run_server
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 139, in handle_connection
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 104, in load_model
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_loader.py", line 151, in load
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 51, in initialize
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().initialize(context)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._service.validate_and_initialize(model_dir=model_dir, context=context)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 178, in validate_and_initialize
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._model = self._run_handler_function(self._model_fn, *(model_dir,))
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 266, in _run_handler_function
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - result = func(*argv)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,811 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 73, in default_model_fn
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,811 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - raise ValueError(
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,811 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - ValueError: Exactly one .pth or .pt file is required for PyTorch models: []
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,817 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,818 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
    2022-09-07T19:03:55,818 [WARN] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
    ...

Fixed log (No default model loaded when torchserve starts)

Metrics report format: prometheus
--
Enable metrics API: true
Workflow Store: /
Model config: N/A
2022-10-31T07:12:55,633 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2022-10-31T07:12:55,651 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-10-31T07:12:55,696 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2022-10-31T07:12:55,697 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-10-31T07:12:55,698 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:40.57777786254883\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:11.410484313964844\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:21.9\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:6150.69921875\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,915 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1175.19921875\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,915 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:19.3\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:57,847 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 13
2022-10-31T07:12:57,847 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:12:57,866 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /169.254.178.2:35152 "GET /models HTTP/1.1" 200 2
2022-10-31T07:12:57,866 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:02,752 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 1
2022-10-31T07:13:02,752 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:07,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:07,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:12,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:12,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:17,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:17,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:22,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:22,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:27,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:27,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:32,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:32,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:37,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:37,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:42,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 1 year ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository