aws / sagemaker-tensorflow-serving-container

A TensorFlow Serving solution for use in SageMaker. This repo is now deprecated.
Apache License 2.0
174 stars 101 forks source link

Batch Transform function starts sending image inference requests before model is actually loaded #189

Closed tbiker closed 3 years ago

tbiker commented 3 years ago

Describe the bug When I use the batch transform method to invoke a custom TensorFlow model (model.tar.gz), it appears the requests to process images start before the actual model file is loaded (180MB file).

Expected behavior I expect the model to load and then the batch processing of images to proceed.

System information

I have a TensorFlow 2.4.1 model in S3 as per the model.tar.gz formats. The model.tar.tz includes inference.py to handle preparing an image payload for the model using input_handler. A requirements.txt file states that the json5 package should be loaded. It appears that all of this starts up correctly.

In my python code, I use:

import numpy as np
import os
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlowModel

sagemaker_session = sagemaker.Session()
try:
    role = get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='Sagemaker')['Role']['Arn']

region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()
prefix = 'cvmodel'
print('Region: {}'.format(region))
print('S3 URI: s3://{}/{}'.format(bucket, prefix))
print('Role:   {}'.format(role))

model = TensorFlowModel(model_data='s3://cvmodel/model.tar.gz',
                            role=role,
                            entry_point='inference.py',
                            framework_version="2.4.1"
                            )
transformer = model.transformer(instance_count=1,
                          instance_type='ml.m4.xlarge',
                          max_concurrent_transforms=1,
                          max_payload=1,
                          output_path='s3://cvmodel/results')
transformer.transform('s3://cvmodel/images', content_type='application/x-image')

**The inference.py input_handler looks like this:**

import base64
import io
import json
import requests

def input_handler(data, context):
    """ Pre-process request input before it is sent to TensorFlow Serving REST API
    Args:
        data (obj): the request data stream
        context (Context): an object containing request and configuration details
    Returns:
        (dict): a JSON-serializable dict that contains request body and headers
    """

    print('input handler called')
    if context.request_content_type == 'application/x-image':
        payload = data.read()
        encoded_image = base64.b64encode(payload).decode('utf-8')
        instance = [{"b64": encoded_image}]
        print('json image produced')
        return json.dumps({"instances": instance})
    else:
        _return_error(415, 'Unsupported content type "{}"'.format(context.request_content_type or 'Unknown'))

This is the log I receive after called the transform function and waiting about 4 minutes:

What is the problem here?

.........................INFO:__main__:starting services
INFO:tfs_utils:using default model name: Servo
INFO:tfs_utils:tensorflow serving model config: 
model_config_list: {
  config: {
    name: "Servo",
    base_path: "/opt/ml/model/export/Servo",
    model_platform: "tensorflow"
  }
}

INFO:__main__:using default model name: Servo
INFO:__main__:tensorflow serving model config: 
model_config_list: {
  config: {
    name: "Servo",
    base_path: "/opt/ml/model/export/Servo",
    model_platform: "tensorflow"
  }
}

INFO:__main__:tensorflow version info:
2021-03-14 23:56:24.251936: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-03-14 23:56:24.252074: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
TensorFlow ModelServer: 2.4.0-rc4+dev.sha.no_git
TensorFlow Library: 2.4.1
INFO:__main__:tensorflow serving command: tensorflow_model_server --port=10000 --rest_api_port=10001 --model_config_file=/sagemaker/model-config.cfg --max_num_load_retries=0 
INFO:__main__:started tensorflow serving (pid: 12)
INFO:__main__:nginx config: 
load_module modules/ngx_http_js_module.so;

worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log  /dev/stderr error;

worker_rlimit_nofile 4096;

events {
  worker_connections 2048;
}

http {
  include /etc/nginx/mime.types;
  default_type application/json;
  access_log /dev/stdout combined;
  js_include tensorflow-serving.js;

  upstream tfs_upstream {
    server localhost:10001;
  }

  upstream gunicorn_upstream {
    server unix:/tmp/gunicorn.sock fail_timeout=1;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;
    client_body_buffer_size 100m;
    subrequest_output_buffer_size 100m;

    set $tfs_version 2.4;
    set $default_tfs_model Servo;

    location /tfs {
        rewrite ^/tfs/(.*) /$1  break;
        proxy_redirect off;
        proxy_pass_request_headers off;
        proxy_set_header Content-Type 'application/json';
        proxy_set_header Accept 'application/json';
        proxy_pass http://tfs_upstream;
    }

    location /ping {
        proxy_pass http://gunicorn_upstream/ping;
    }

    location /invocations {
        proxy_pass http://gunicorn_upstream/invocations;
    }

    location /models {
        proxy_pass http://gunicorn_upstream/models;
    }

    location / {
        return 404 '{"error": "Not Found"}';
    }

    keepalive_timeout 3;
  }
}

INFO:__main__:installing packages from requirements.txt...
2021-03-14 23:56:24.637383: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-03-14 23:56:24.637508: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2021-03-14 23:56:24.640239: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2021-03-14 23:56:24.640284: I tensorflow_serving/model_servers/server_core.cc:587]  (Re-)adding model: Servo
2021-03-14 23:56:24.740563: I tensorflow_serving/util/retrier.cc:46] Retrying of Reserving resources for servable: {name: Servo version: 1} exhausted max_num_retries: 0
2021-03-14 23:56:24.740603: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: Servo version: 1}
2021-03-14 23:56:24.740626: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: Servo version: 1}
2021-03-14 23:56:24.740645: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: Servo version: 1}
2021-03-14 23:56:24.740723: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading SavedModel from: /opt/ml/model/export/Servo/1
2021-03-14 23:56:24.888950: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta graph with tags { serve }
2021-03-14 23:56:24.889008: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading SavedModel debug info (if present) from: /opt/ml/model/export/Servo/1
2021-03-14 23:56:24.889602: I external/org_tensorflow/tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Collecting json5
  Downloading json5-0.9.5-py2.py3-none-any.whl (17 kB)
2021-03-14 23:56:25.373645: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
Installing collected packages: json5
Successfully installed json5-0.9.5
2021-03-14 23:56:25.435171: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300050000 Hz
INFO:__main__:gunicorn command: gunicorn -b unix:/tmp/gunicorn.sock -k gevent --chdir /sagemaker --pythonpath /opt/ml/model/code -e TFS_GRPC_PORT=10000 -e SAGEMAKER_MULTI_MODEL=False -e SAGEMAKER_SAFE_PORT_RANGE=10000-10999 python_service:app
INFO:__main__:gunicorn version info:
gunicorn (version 20.0.4)
INFO:__main__:started gunicorn (pid: 47)
[2021-03-14 23:56:26 +0000] [47] [INFO] Starting gunicorn 20.0.4
[2021-03-14 23:56:26 +0000] [47] [INFO] Listening at: unix:/tmp/gunicorn.sock (47)
INFO:__main__:gunicorn server is ready!
[2021-03-14 23:56:26 +0000] [47] [INFO] Using worker: gevent
[2021-03-14 23:56:26 +0000] [51] [INFO] Booting worker with pid: 51
INFO:__main__:nginx version info:
nginx version: nginx/1.18.0
built by gcc 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) 
built with OpenSSL 1.1.1  11 Sep 2018
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.18.0/debian/debuild-base/nginx-1.18.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'
INFO:__main__:started nginx (pid: 52)
169.254.255.130 - - [14/Mar/2021:23:56:26 +0000] "GET /ping HTTP/1.1" 200 0 "-" "Go-http-client/1.1"
169.254.255.130 - - [14/Mar/2021:23:56:26 +0000] "GET /execution-parameters HTTP/1.1" 404 22 "-" "Go-http-client/1.1"
INFO:python_service:http://gunicorn_upstream/invocations
INFO:tfs_utils:sagemaker tfs attributes: 
{}
input handler called
json image produced
ERROR:python_service:exception handling request: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe17b5a810>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 607, in connect
    raise _SocketError(err, strerror(err))
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7efe17b5a810>: Failed to establish a new connection: [Errno 111] Connection refused

169.254.255.130 - - [14/Mar/2021:23:56:26 +0000] "POST /invocations HTTP/1.1" 500 283 "-" "Go-http-client/1.1"
INFO:python_service:http://gunicorn_upstream/invocations
input handler called # This is my print statement in my input_handler
INFO:tfs_utils:sagemaker tfs attributes: 
{}
json image produced # This is my print statement in my input_handler
ERROR:python_service:exception handling request: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e3b790>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 607, in connect
    raise _SocketError(err, strerror(err))
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7efe16e3b790>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 573, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e3b790>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sagemaker/python_service.py", line 292, in _handle_invocation_post
    res.body, res.content_type = self._handlers(data, context)
  File "/sagemaker/python_service.py", line 325, in handler
    response = requests.post(context.rest_uri, data=processed_input)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e3b790>: Failed to establish a new connection: [Errno 111] Connection refused'))
169.254.255.130 - - [14/Mar/2021:23:56:26 +0000] "POST /invocations HTTP/1.1" 500 283 "-" "Go-http-client/1.1"
INFO:python_service:http://gunicorn_upstream/invocations
input handler called
INFO:tfs_utils:sagemaker tfs attributes: 
{}
json image produced
ERROR:python_service:exception handling request: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e3b4d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
  File "/usr/local/lib/python3.7/site-packages/gevent/_socketcommon.py", line 607, in connect
    raise _SocketError(err, strerror(err))
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7efe16e3b4d0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 573, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e3b4d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 573, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e1ba10>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sagemaker/python_service.py", line 292, in _handle_invocation_post
    res.body, res.content_type = self._handlers(data, context)
  File "/sagemaker/python_service.py", line 325, in handler
    response = requests.post(context.rest_uri, data=processed_input)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e1ba10>: Failed to establish a new connection: [Errno 111] Connection refused'))
2021-03-14 23:56:27.554814: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /opt/ml/model/export/Servo/1
2021-03-14 23:56:27.901896: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 3161163 microseconds.
2021-03-14 23:56:27.987438: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /opt/ml/model/export/Servo/1/assets.extra/tf_serving_warmup_requests
2021-03-14 23:56:27.988975: I tensorflow_serving/util/retrier.cc:46] Retrying of Loading servable: {name: Servo version: 1} exhausted max_num_retries: 0
2021-03-14 23:56:27.989006: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: Servo version: 1}
2021-03-14 23:56:27.991672: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:10000 ...
[warn] getaddrinfo: address family for nodename not supported
2021-03-14 23:56:27.992764: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:10001 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...

2021-03-14T23:56:26.474:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=1, BatchStrategy=MULTI_RECORD
2021-03-14T23:56:26.572:[sagemaker logs]: cvmodel/images/123_1024x768.jpg: Bad HTTP status received from algorithm: 500
2021-03-14T23:56:26.573:[sagemaker logs]: cvmodel/images/123_1024x768.jpg: 
2021-03-14T23:56:26.573:[sagemaker logs]: cvmodel/images/123_1024x768.jpg: Message:
2021-03-14T23:56:26.573:[sagemaker logs]: cvmodel/images/123_1024x768.jpg: {"error": "HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e2cf10>: Failed to establish a new connection: [Errno 111] Connection refused'))"}
2021-03-14T23:56:26.645:[sagemaker logs]: cvmodel/images/124_1024x768.jpg: Bad HTTP status received from algorithm: 500
2021-03-14T23:56:26.645:[sagemaker logs]: cvmodel/images/124_1024x768.jpg: 
2021-03-14T23:56:26.645:[sagemaker logs]: cvmodel/images/124_1024x768.jpg: Message:
2021-03-14T23:56:26.645:[sagemaker logs]: cvmodel/images/124_1024x768.jpg: {"error": "HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/Servo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe16e1ba10>: Failed to establish a new connection: [Errno 111] Connection refused'))"}
liangma8712 commented 3 years ago

Fixed in https://github.com/aws/sagemaker-tensorflow-serving-container/pull/192