aws / sagemaker-tensorflow-serving-container

A TensorFlow Serving solution for use in SageMaker. This repo is now deprecated.
Apache License 2.0
174 stars 101 forks source link

[bug] : Model not loading while using existing container image to setup MME on sagemaker #170

Open abhi1793 opened 3 years ago

abhi1793 commented 3 years ago

Checklist

Concise Description: Getting this error, when invoking a MME on sagemaker setup using 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-cpu-py37-ubuntu18.04 container image.

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=14448): Max retries exceeded with url: /v1/models/d2295a7526f9df36354b8a2c4adc4f63 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f70966dba50>: Failed to establish a new connection: [Errno 111] Connection refused')) Traceback (most recent call last): File "/sagemaker/python_service.py", line 157, in _handle_load_model_post self._wait_for_model(model_name) File "/sagemaker/python_service.py", line 247, in _wait_for_model response = session.get(url) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 546, in get return self.request('GET', url, kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request resp = self.send(prep, send_kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send raise ConnectionError(e, request=request)

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-cpu-py37-ubuntu18.04 Current behavior:

Expected behavior: Model should load up and return prediction Additional context: I have setup a MME using the above mentioned container and invoking the endpoint using a lambda. The model files are in placed in S3 and are in the correct directory structure with a version number.

ajaykarpur commented 3 years ago

How large is your model? After the load model request is sent, the container waits for a period of time to ensure the model is available to the model server. If the model is very large, however, the container might not have waited sufficient time for the model to load, causing a ConnectionError.