aws / sagemaker-tensorflow-training-toolkit

Toolkit for running TensorFlow training scripts on SageMaker. Dockerfiles used for building SageMaker TensorFlow Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
270 stars 160 forks source link

input_fn not being applied for incoming requests in non-local mode #87

Closed fernandoamat closed 4 years ago

fernandoamat commented 5 years ago

Hi, I have setup a small regression example to test Sagemaker Tensorflow pipeline from beginning to end following the examples here.

Everything worked OK, except when I want to overwrite input_fn method in my module so I can parse/transform incoming data at prediction time. If I use instance_type='local' the code works and the customized input_fn method is called during the estimator.predict call. However, when I deploy the model to an endpoint, and I try to query it with the following call

aws sagemaker-runtime invoke-endpoint --region 'us-east-2' --endpoint-name house-price-estimator-test-synthetic-keras --body "{\"int_living_sqft\": 0.5, \"int_beds\": 0.2}" --content-type "application/json" --accept "application/json" outputJson.json

the custom method input_fn seems to not be called. If I call the endpoint like this

aws sagemaker-runtime invoke-endpoint --region 'us-east-2' --endpoint-name house-price-estimator-test-synthetic-keras --body "{\"inputs\": [ [0.5, 0.2]]}" --content-type "application/json" outputJson.json

the request succeeds, which is a clear indication it is using the default Json parser from here.

I include a Zip file to reproduce everything: it has the notebook used to train and deploy a simple toy regression model in a Sagemaker notebook. It also has the python file with the model_fn, input_fn and other methods required by the python notebook.

Any help would be appreciated as I am out of ideas on what can be going wrong and why input_fn is not being called by the endpoint when it receives a prediction request. Again the instance_type='local' seems to work fine.

Thanks, Fernando

sagemaker_input_fn_issue.zip

This is the traceback from the CloudWatch Console:

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/container_support/serving.py", line 182, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 281, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 208, in f
prediction = self.predict_fn(input)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 223, in predict_fn
return self.proxy_client.request(data)
File "/usr/local/lib/python2.7/dist-packages/tf_container/proxy_client.py", line 71, in request
return request_fn(data)
File "/usr/local/lib/python2.7/dist-packages/tf_container/proxy_client.py", line 99, in predict
result = self.prediction_service_stub.Predict(request, self.request_timeout)
File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
self._request_serializer, self._response_deserializer)
File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
raise _abortion_error(rpc_error_call)

AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="input size does not match signature")
icywang86rui commented 5 years ago

Hi,

LocalMode is implemented outside of the containers. For the code running inside of the containers it should work exactly the same with localmode and cloudmode. Could you show me the entire log from endpoint? Including when the endpoint was starting. Thanks!

fernandoamat commented 5 years ago

Hi @icywang86rui , Here I copy the entire log from cloudwatch for the endpoint: it basically shows from the moment I call estimator.deploy to create the endpoint to the moment I try to invoke the endpoint and it returns an error. Let me know if you need anything else. Thanks, @fernandoamat

2018-10-16 12:34:55,034 INFO - root - running container entrypoint
2018-10-16 12:34:55,034 INFO - root - starting serve task
2018-10-16 12:34:55,034 INFO - container_support.serving - reading config
2018-10-16 12:34:55,548 INFO - container_support.serving - importing user module
2018-10-16 12:34:55,548 INFO - container_support.serving - loading framework-specific dependencies
2018-10-16 12:34:57,024 INFO - container_support.serving - starting nginx
2018-10-16 12:34:57,043 INFO - container_support.serving - starting gunicorn
2018-10-16 12:34:57,051 INFO - container_support.serving - inference server started. waiting on processes: set([21, 22])
2018-10-16 12:34:57.117647: I tensorflow_serving/model_servers/main.cc:154] Building single TensorFlow model file config: model_name: generic_model model_base_path: /opt/ml/model/export/Servo
2018-10-16 12:34:57.118730: I tensorflow_serving/model_servers/server_core.cc:444] Adding/updating models.
2018-10-16 12:34:57.118753: I tensorflow_serving/model_servers/server_core.cc:499] (Re-)adding model: generic_model
2018-10-16 12:34:57.123848: I tensorflow_serving/core/basic_manager.cc:716] Successfully reserved resources to load servable {name: generic_model version: 1538917129}
2018-10-16 12:34:57.123868: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: generic_model version: 1538917129}
2018-10-16 12:34:57.123983: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: generic_model version: 1538917129}
2018-10-16 12:34:57.124102: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /opt/ml/model/export/Servo/1538917129
2018-10-16 12:34:57.124180: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:242] Loading SavedModel with tags: { serve }; from: /opt/ml/model/export/Servo/1538917129
2018-10-16 12:34:57.125830: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-10-16 12:34:57.145138: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:161] Restoring SavedModel bundle.
2018-10-16 12:34:57.156822: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:196] Running LegacyInitOp on SavedModel bundle.
2018-10-16 12:34:57.164458: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:291] SavedModel load for tags { serve }; Status: success. Took 40328 microseconds.
2018-10-16 12:34:57.164739: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: generic_model version: 1538917129}
2018-10-16 12:34:57.168421: I tensorflow_serving/model_servers/main.cc:316] Running ModelServer at 0.0.0.0:9000 ...
[2018-10-16 12:34:57 +0000] [22] [INFO] Starting gunicorn 19.9.0
[2018-10-16 12:34:57 +0000] [22] [INFO] Listening at: unix:/tmp/gunicorn.sock (22)
[2018-10-16 12:34:57 +0000] [22] [INFO] Using worker: gevent
[2018-10-16 12:34:57 +0000] [47] [INFO] Booting worker with pid: 47
[2018-10-16 12:34:57 +0000] [48] [INFO] Booting worker with pid: 48
2018-10-16 12:34:57,487 INFO - container_support.serving - creating Server instance
2018-10-16 12:34:57,509 INFO - container_support.serving - creating Server instance
2018-10-16 12:34:58,805 INFO - tf_container - ---------------------------Model Spec---------------------------
2018-10-16 12:34:58,806 INFO - tf_container - {
"modelSpec": {
"version": "1538917129", 
"name": "generic_model"
}, 
"metadata": {
"signature_def": {
"@type": "type.googleapis.com/tensorflow.serving.SignatureDefMap", 
"signatureDef": {
"serving_default": {
"inputs": {
"inputs": {
"dtype": "DT_FLOAT", 
"name": "Placeholder_1:0", 
"tensorShape": {
"dim": [
{
"size": "-1"
}, 
{
"size": "2"
}
]
}
}
}, 
"methodName": "tensorflow/serving/predict", 
"outputs": {
"price": {
"dtype": "DT_FLOAT", 
"name": "Reshape:0", 
"tensorShape": {
"dim": [
{
"size": "-1"
}
]
}
}
}
}
}
}
}
}
2018-10-16 12:34:58,807 INFO - tf_container - ----------------------------------------------------------------
2018-10-16 12:34:58,807 INFO - tf_container - TF Serving model successfully loaded
2018-10-16 12:34:58,810 INFO - container_support.serving - returning initialized server
2018-10-16 12:34:58,824 INFO - tf_container - ---------------------------Model Spec---------------------------
2018-10-16 12:34:58,824 INFO - tf_container - {
"modelSpec": {
"version": "1538917129", 
"name": "generic_model"
}, 
"metadata": {
"signature_def": {
"@type": "type.googleapis.com/tensorflow.serving.SignatureDefMap", 
"signatureDef": {
"serving_default": {
"inputs": {
"inputs": {
"dtype": "DT_FLOAT", 
"name": "Placeholder_1:0", 
"tensorShape": {
"dim": [
{
"size": "-1"
}, 
{
"size": "2"
}
]
}
}
}, 
"methodName": "tensorflow/serving/predict", 
"outputs": {
"price": {
"dtype": "DT_FLOAT", 
"name": "Reshape:0", 
"tensorShape": {
"dim": [
{
"size": "-1"
}
]
}
}
}
}
}
}
}
}
2018-10-16 12:34:58,824 INFO - tf_container - ----------------------------------------------------------------
2018-10-16 12:34:58,825 INFO - tf_container - TF Serving model successfully loaded
2018-10-16 12:34:58,826 INFO - container_support.serving - returning initialized server
[2018-10-16 12:36:46,649] ERROR in serving: invalid literal for float(): 0.5,0.2
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/container_support/serving.py", line 182, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 281, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 207, in f
input = input_fn(serialized_data, content_type)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 238, in _default_input_fn
return self._parse_csv_request(serialized_data)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 195, in _parse_csv_request
full_array = [float(i) for i in row]
ValueError: invalid literal for float(): 0.5,0.2
[2018-10-16 12:36:46,650] ERROR in serving: invalid literal for float(): 0.5,0.2
icywang86rui commented 5 years ago

Could you show me the environment variables set in your sagemaker model's primary container? You can find this from the aws console on the model's page. Let's confirm that the model is setup correctly.

fernandoamat commented 5 years ago

Hi @icywang86rui, Here are the environment variables listed in the models page.

SAGEMAKER_CONTAINER_LOG_LEVEL   20
SAGEMAKER_ENABLE_CLOUDWATCH_METRICS false
SAGEMAKER_PROGRAM   keras_linear_regression_synthetic_house_price.py
SAGEMAKER_REGION    us-east-2
SAGEMAKER_SUBMIT_DIRECTORY  s3://test-estimator/customcode/tensorflow_synthetic_test_house_price/house-price-estimator-tensorflow-test-ts1539692900/source/sourcedir.tar.gz

Thanks, Fernando

mvsusp commented 5 years ago

Hi @fernandoamat ,

The input_fn is working in SageMaker as well. Let me explain what I think is happening:

  1. Your input_fn receives "{\"inputs\": [ [0.5, 0.2]]}" and returns {"inputs": [ [0.5, 0.2]]}
  2. The server sends the data to the proxy client https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/serve.py#L223
  3. The proxy client needs to transform your dictionary to a valid GRPC request https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/proxy_client.py#L98
  4. Your request is translated to a valid GRPC https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/proxy_client.py#L145

To achieve your goal, which is to receive "{\"int_living_sqft\": 0.5, \"int_beds\": 0.2}" as a valid request , I would do:

def input_fn(data, content_type):
    """
    if content_type == "application/json":
        # Expected a Json string like {"int_living_sqft": 0.5, "int_beds": 0.2}
        obj = json.loads(clean_input)
        return {"inputs": [[obj['int_living_sqft'], obj['int_beds']]}

Please, let me know if it works.

Thanks for using SageMaker.

Márcio

fernandoamat commented 5 years ago

Hi @mvsusp Thanks for taking a look at this and for the clear explanation. I included the modification you suggested and rerun the notebook to retrain and redeploy the endpoint and unfortunately it does not work. When I call the endpoint like this:

aws sagemaker-runtime invoke-endpoint --region 'us-east-2' --endpoint-name house-price-estimator-test-synthetic-keras --body "{\"int_living_sqft\": 0.5, \"int_beds\": 0.2}" --content-type "application/json" --accept "application/json" outputJson.json

I get the following stacktrace in cloudwatch:

[2018-10-30 12:48:02,868] ERROR in serving: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="input size does not match signature")
12:48:03
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/container_support/serving.py", line 182, in _invoke self.transformer.transform(content, input_content_type, requested_output_content_type) File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 281, in transform return self.transform_fn(data, content_type, accepts)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/container_support/serving.py", line 182, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 281, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 208, in f
prediction = self.predict_fn(input)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 223, in predict_fn
return self.proxy_client.request(data)
File "/usr/local/lib/python2.7/dist-packages/tf_container/proxy_client.py", line 71, in request
return request_fn(data)
File "/usr/local/lib/python2.7/dist-packages/tf_container/proxy_client.py", line 99, in predict
result = self.prediction_service_stub.Predict(request, self.request_timeout)
File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 309, in __call__
self._request_serializer, self._response_deserializer)
File "/usr/local/lib/python2.7/dist-packages/grpc/beta/_client_adaptations.py", line 195, in _blocking_unary_unary
raise _abortion_error(rpc_error_call)
AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="input size does not match signature")
AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="input size does not match signature")

However, when I call the same endpoint like this

aws sagemaker-runtime invoke-endpoint --region 'us-east-2' --endpoint-name house-price-estimator-test-synthetic-keras --body "{\"inputs\": [ [0.5, 0.2]]}" --content-type "application/json" outputJson.json

the requests success and returns the expected result. This is why I am puzzled: if the input_fn was being called, the second request should not succeed.

The error stack trace and error are pretty cryptic so I am not sure where to look at.

My input_fn looks like this:

def clean_serialized_input(serialized_input):
    """
    Saemaker request adds all sorts of odd balls in the serialized string from the request
    This method tries to clean it up to be robust to that.
    :param serialized_input:
    :return:
    """
    clean_input = serialized_input.replace("\\", "")  # request adds \" to the string
    if clean_input[0] == "\"":
        offset_start = 1
    else:
        offset_start = 0
    if clean_input[-1] == "\"":
        return clean_input[offset_start:-1]
    else:
        return clean_input[offset_start:]

def input_fn(data, content_type):
    if content_type == "application/json":
        # Expected a Json string like {"int_living_sqft": 0.5, "int_beds": 0.2}
        clean_input = clean_serialized_input(data)
        # Request adds " in the string
        obj = json.loads(clean_input)
        # change suggested by @mvsusp 
        #return [[obj['int_living_sqft'], obj['int_beds']]]
        return {INPUT_TENSOR_NAME: [[obj['int_living_sqft'], obj['int_beds']]]}
    elif content_type == "text/csv":
        clean_input = clean_serialized_input(data)
        # Request adds " in the string
        return [[float(i) for i in clean_input.split(',')]]
    else:
        raise ValueError(
            'Endpoint is not prepared for contenty type {}. It only accepts application/json or text/csv'.format(
                content_type))

Any help is appreciated. Thanks, @fernandoamat

nadiaya commented 5 years ago

Hi, As mentioned above by my teammates the input_fn/default_input_fn logic is part of the container and does not depend on local/non-local mode.

Would it be possible to see how you deployed the endpoint in local and non-local mode cases? Did you use python SDK estimator both times?

What happens if you query the endpoint using python sdk?

Also would it be possible to see the tar file from s3://test-estimator/customcode/tensorflow_synthetic_test_house_price/house-price-estimator-tensorflow-test-ts1539692900/source/sourcedir.tar.gz with source code container was running.

laurenyu commented 4 years ago

closing due to inactivity. feel free to reopen if necessary.