Open abhi1793 opened 3 years ago
How large is your model? After the load model request is sent, the container waits for a period of time to ensure the model is available to the model server. If the model is very large, however, the container might not have waited sufficient time for the model to load, causing a ConnectionError
.
Checklist
Concise Description: Getting this error, when invoking a MME on sagemaker setup using
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-cpu-py37-ubuntu18.04
container image.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=14448): Max retries exceeded with url: /v1/models/d2295a7526f9df36354b8a2c4adc4f63 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f70966dba50>: Failed to establish a new connection: [Errno 111] Connection refused')) Traceback (most recent call last): File "/sagemaker/python_service.py", line 157, in _handle_load_model_post self._wait_for_model(model_name) File "/sagemaker/python_service.py", line 247, in _wait_for_model response = session.get(url) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 546, in get return self.request('GET', url, kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request resp = self.send(prep, send_kwargs) File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send raise ConnectionError(e, request=request)
DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-cpu-py37-ubuntu18.04 Current behavior:
Expected behavior: Model should load up and return prediction Additional context: I have setup a MME using the above mentioned container and invoking the endpoint using a lambda. The model files are in placed in S3 and are in the correct directory structure with a version number.