bentoml / aws-sagemaker-deploy

Fast model deployment on AWS Sagemaker
Apache License 2.0
15 stars 15 forks source link

SageMaker cannot hit the endpoint #18

Closed NaxAlpha closed 3 years ago

NaxAlpha commented 3 years ago

Describe the bug

After deploying the dev endpoint (ref: bentoml/aws-sagemaker-deploy#13), I cannot get the response with this:

curl -i \
    -X POST \
    -F image=@data/mobile-sample.png \
    https://123.execute-api.region.amazonaws.com/prod/predict

Looking at couldwatch: logs show this error: image

To Reproduce

  1. Deploy an API on SageMaker that takes an image as input.
  2. Call the API by sending an Image

Expected behavior

Running the API by using bentoml server Model:latest works but is not reproducible for the deployed API on SageMaker.

Screenshots/Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/bento/wsgi.py", line 25, in view_function
    response = api.handle_request(req)
  File "/opt/conda/lib/python3.8/site-packages/bentoml/service/inference_api.py", line 294, in handle_request
    inf_task = self.input_adapter.from_http_request(request)
  File "/opt/conda/lib/python3.8/site-packages/bentoml/adapters/utils.py", line 129, in _method
    return method(self, req)
  File "/opt/conda/lib/python3.8/site-packages/bentoml/adapters/file_input.py", line 148, in from_http_request
    _, _, files = HTTPRequest.parse_form_data(req)
  File "/opt/conda/lib/python3.8/site-packages/bentoml/types.py", line 234, in parse_form_data
    stream, form, files = parse_form_data(environ, silent=False)
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/formparser.py", line 126, in parse_form_data
    return FormDataParser(
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/formparser.py", line 230, in parse_from_environ
    return self.parse(get_input_stream(environ), mimetype, content_length, options)
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/formparser.py", line 265, in parse
    return parse_func(self, stream, mimetype, content_length, options)
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/formparser.py", line 142, in wrapper
    return f(self, stream, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/werkzeug/formparser.py", line 290, in _parse_multipart
    raise ValueError("Missing boundary")

Environment:

Additional context

jjmachan commented 3 years ago

Firstly really sorry for the late reply. The issue seems to be a configuration error in the API Gateway. This only happens when you passing data as a multipart/form-data type but if you pass it in a binary format it will work.

I'm working on adding support for multipart/form-data too but in the meantime, you can refer https://stackoverflow.com/a/56132015 or https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-console.html to get an idea of how to solve it on your end.

Again really sorry for the late reply but I hope this helps : )

NaxAlpha commented 3 years ago

Thanks a lot for the update. Any timeline for the fix? and is it related to this repo or bentoml/aws-sagemaker-deploy?

NaxAlpha commented 3 years ago

BTW I also did try the binary method like this:

curl -i \
    -X POST \
    --header "Content-Type:application/octet-stream" \
    --data-binary @data/mobile-sample.png \
    https://123.execute-api.region.amazonaws.com/prod/predict

But I get this error: image

jjmachan commented 3 years ago

which input handler are you using? also can you post your bentoservice too

NaxAlpha commented 3 years ago

Here is the sample service I am using:

class Model(bentoml.BentoService):
    @bentoml.api(input=ImageInput(), batch=False)
    def predict(self, image):
        img = np.array(image)
        ...
cliu0507 commented 3 years ago

Here is the sample service I am using:

class Model(bentoml.BentoService):
    @bentoml.api(input=ImageInput(), batch=False)
    def predict(self, image):
        img = np.array(image)
        ...

I have met similar problem. My understanding is that you will need to use FileInput() as input adapter.

But but but...

I tried to use FileInput() and code like below unfortunately still can't get it working on AWS Sagemaker + API Gateway (Local Running the API by using bentoml server Model:latest works well without any problems)

def predict(self, file_streams: List[BinaryIO]) -> List[str]:
        print('start to process')
        for fs in file_streams:
            image_pil = Image.open(fs)
            image_numpy = np.array(image_pil)
            ...

I did choose to change settings on API gateway to passthrough all binary input but still getting error information like:

AttributeError: 'FileLike' object has no attribute 'readline' 

[2021-09-12 20:19:27,814] ERROR - Error caught in API function:
Traceback (most recent call last):  File "/opt/conda/lib/python3.8/site-packages/bentoml/service/inference_api.py", 
line 176, in wrapped_func    return self._user_func(*args, **kwargs)  File "/bento/ImageClassifier/Image_classifier.py",
 line 83, in predict    image_pil = Image.open(fs)  File "/opt/conda/lib/python3.8/site-packages/PIL/Image.py", 
line 2944, in open    im = _open_core(fp, filename, prefix, formats)  File "/opt/conda/lib/python3.8/site-packages/PIL/Image.py", 
line 2930, in _open_core    im = factory(fp, filename)  File "/opt/conda/lib/python3.8/site-packages/PIL/ImageFile.py", 
line 121, in __init__    self._open()  File "/opt/conda/lib/python3.8/site-packages/PIL/ImImagePlugin.py", 
line 153, in _open    s = s + self.fp.readline()

I suspect AWS API gateway does something weird to binary data(maybe reformat it) and then pass to Sagemaker because obviously this line "image_pil = Image.open(fs)" failed with fs not being a valid binaryIO handler

@jjmachan Plz advise ty!

cliu0507 commented 3 years ago

Firstly really sorry for the late reply. The issue seems to be a configuration error in the API Gateway. This only happens when you passing data as a multipart/form-data type but if you pass it in a binary format it will work.

I'm working on adding support for multipart/form-data too but in the meantime, you can refer https://stackoverflow.com/a/56132015 or https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-console.html to get an idea of how to solve it on your end.

Again really sorry for the late reply but I hope this helps : )

Thanks @jjmachan , after following this, I was able to get it working. For folks who has problems still, plz make sure to click 'deploy' the API gateway manually to make all changes effective. See screenshot below Screenshot from 2021-09-12 18-03-54

jjmachan commented 3 years ago

Thanks @cliu0507 for adding this step too 🙌🏽

jjmachan commented 3 years ago

hey, @NaxAlpha @cliu0507 We have added another method of dealing with ImageInputHandler and form-data and also adds support for multiple endpoints too. It would be really awesome if you can take a look at that and see if it solves these issues and also your feedback is appreciated too.

NaxAlpha commented 3 years ago

Just tested it for form/multipart and it seems to be working overall except for one thing which is that after redeploying, the first call gives this error (Whether I call the API 1 minute after deployment or 10 minutes):

{
  "message": "Internal Server Error"
}

But right after this first call, if I call again, then it works perfectly. Also, I still could not test the ImageInput because of the GPU issue.

jjmachan commented 3 years ago

that is strange, can you get the logs from the API Gateway so that we can get a better idea may be as to what is happening. You can refer to this article https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-troubleshooting-lambda.html since the API gateway logs are not set up by default.

Also, I guess there are no logs from the sagemakers cloudwatch logs and this is a problem with the API Gateway?

NaxAlpha commented 3 years ago

yeah when the error message comes, no logs are visible neither in Lambda nor in Endpoint

jjmachan commented 3 years ago

then it some issue with the API Gateway, it would be really helpful if we can get the logs for the API gateway since I'm not able to reproduce the issue locally.

NaxAlpha commented 3 years ago

This is what I got for the first two requests where I got internal server error after enabling gateway logs:

image

Furthermore, it looks like the gateway IS calling the SageMaker and there are corresponding logs for every failed request in both Lambda and SageMaker but due to maybe bootstrapping request is failing after 3 seconds in lambda:

image

jjmachan commented 3 years ago

Thanks a lot for getting the logs, the issue was the 3 seconds lambda timeout and now I've patched that to be the same as the timeout config option we have (so a healthy 60sec by default). Thanks so much for bringing trying it out and bring up the issue, wouldn't have figured it out otherwise 😄

NaxAlpha commented 3 years ago

ok just tried the multipart and imageinput services after this update. Both are working 👍