aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
379 stars 82 forks source link

Error in batch transform with custom image #9

Closed velociraptor111 closed 4 years ago

velociraptor111 commented 4 years ago

Describe the problem

I needed to add GluonCV library in my code environment, and since the Default MXNet container does not have the python package, I needed to create a custom image with the python package installed.

I got the default MXNet container from here: https://github.com/aws/sagemaker-mxnet-serving-container and follow all the instructions. To include GluonCV, i then simply added this to the docker file and build the image

RUN ${PIP} install --no-cache-dir mxnet-mkl==$MX_VERSION \
                                  mxnet-model-server==$MMS_VERSION \
                                  keras-mxnet==2.2.4.1 \
                                  numpy==1.14.5 \
                  gluoncv \
                                  onnx==1.4.1 \
                                  ...

I build the image, then uploaded it to a AWS ECR.

I am able to verify that the docker image has been successfully uploaded and I have a valid URI like so: 552xxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3

THEN, when instantiating the MXNet model, I added a reference to this image URI like so

sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
                            role = role, 
                             entry_point = 'entry_point.py',
                             image = '552xxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3',
                             py_version='py3',
                             framework_version='1.4.1',
                            sagemaker_session = sagemaker_session)

BUT i got an error message: Here is the full log

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 21, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 54, in main
_start_model_server()
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 49, in _start_model_server
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

OSError: [Errno 14] Bad address: 'tail'
ChoiByungWook commented 4 years ago

Hello @velociraptor111,

I think most likely this issue isn't with the Python SDK, but probably the code in the MXNet container itself. In particular the package that it is using, which is: https://github.com/aws/sagemaker-inference-toolkit.

In particular this line: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L61. That line is meant to keep the container running, however replacing it with a different solution might be better due to the error you are getting.

As a work around, you might need to modify the inference-tool-kit to do like a 3 second sleep right before the tail call or change it to another solution. The simplest answer I can think of is a while true.

I'm going to transfer this issue to the toolkit.

velociraptor111 commented 4 years ago

Oh I see, why is it failing for mine but not the current default MXNet container since my docker image is based from the default MXNet container? All I added was one line which was to import GluonCV.

It seems like the fix for this might take awhile, in the meantime is there any other way to simply import an additional python package like GluonCV , to be accessible in my entry_point.py?

Thank you in advance

ChoiByungWook commented 4 years ago

@velociraptor111,

Good question, I'm not too sure about that one. Is the error consistent?

Usually requirements.txt is the way to go when you want to add libraries into your container, but for the MXNet container I believe requirements.txt support was forgotten and needs to be added. Without that, the only option is to go about it the way you did, which was to create a new image with your dependencies, which is never ideal.

One thing you could do, is to modify your entry_point.py to pip install your modules, which is pretty hackish, but most likely a lot less work. I do recommend using "local_mode" for your iterations, as it is quicker than waiting for instances to provision on SageMaker.

Apologies for the experience, I'll see what I can do.

velociraptor111 commented 4 years ago

Yeah, the error is consistent. I have tried deploying two times now and still received the same error.

Usually_ requirements.txt is the way to go when you want to add libraries into your container, but for the MXNet container I believe requirements.txt support was forgotten and needs to be added. Without that, the only option is to go about it the way you did, which was to create a new image with your dependencies, which is never ideal. No wonder it didn't work for me .. I found a person also similar issue of importing third party libraries but the solution outlined there still didn't work for me when I tried. Here is the link to my comment on an OLD issue: https://github.com/aws/sagemaker-python-sdk/issues/664#issuecomment-541973833

One thing you could do, is to modify your entry_point.py to pip install your modules, which is pretty hackish, but most likely a lot less work. I do recommend using "local_mode" for your iterations, as it is quicker than waiting for instances to provision on SageMaker. Yeah I was thinking of doing this, but the thought of it seems pretty hackish and not sure if it might effect inference time.

Okay, please let me know how to fix this.

My team and myself are seriously considering to use AWS Sagemaker for a huge chunk of our application & the support has been the best so far !

ChoiByungWook commented 4 years ago

@velociraptor111,

Gotcha, thanks for the information.

Looks like there are two problems, one for the tail call and one for requirements.txt.

I'll start with the tail call, since that can potentially cause jobs to fail regardless of the dependencies in them or not.

velociraptor111 commented 4 years ago

Hi @ChoiByungWook ,

I have some updates regarding the bug. I found some inconsistencies even with the default MXNet Serving container.

Here is the command:

sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
                            role = role, 
                             entry_point = 'entry_point.py',
                             py_version='py3',
                             framework_version='1.4.1',
                            sagemaker_session = sagemaker_session)

transformer = sagemaker_model.transformer(instance_count=1, instance_type='ml.m4.xlarge', output_path=batch_output)

transformer.transform(data=batch_input, content_type='application/x-image')

transformer.wait()

I ran the same script three times and here are the results & error messages:

FIRST RUN

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 8, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 42, in main
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 57, in start_model_server
mms_process = subprocess.Popen(mxnet_model_server_cmd)
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

OSError: [Errno 14] Bad address: 'mxnet-model-server'

SECOND RUN

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 8, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 42, in main
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

OSError: [Errno 14] Bad address: 'tail'

THIRD RUN SUCCESS

The error is inconsistent. I suspect it's something to do with delay and timers in the server code as you have previously mentioned.

ChoiByungWook commented 4 years ago

Hello @velociraptor111,

Thank you for reporting all of this, I apologize for the frustrating experience.

That first run's error is incredibly concerning. I'm not too sure how to solve that issue, as we are relying on mxnet-model-server to be run. I think the only thing I can think of is to put a sleep or have multiple retries on starting the model server, however I'm not too sure if this is a Docker level problem or even lower.

As for the second run: I just submitted a PR which won't call tail anymore. https://github.com/aws/sagemaker-inference-toolkit/pull/11

I'm going to do some more testing before merging that PR.

I'll work on the requirements.txt in between.

Requirements.txt: https://github.com/aws/sagemaker-inference-toolkit/pull/12

Might take a few days before these get reviewed, tested properly and released.

ChoiByungWook commented 4 years ago

Both changes have been merged. The next steps are to make sure the containers that are dependent on this change consume them.

Which means updating the following:

velociraptor111 commented 4 years ago

Thanks for the timely fix!

Once those additional updates have been done, it will be automatically reflected in the current SDK right?

laurenyu commented 4 years ago

Opened PRs to update the respective framework images:

these will become part of the latest pre-built PyTorch and MXNet images, which are pulled by the SDK. As long as you're using the latest framework version, yes, the changes will be automatically reflected.

velociraptor111 commented 4 years ago

Hi just a follow up error after trying out again. Please check my additional followup comments in https://github.com/aws/sagemaker-inference-toolkit/pull/11

laurenyu commented 4 years ago

apologies for the lack of update - the changes have still not been released to the MXNet and PyTorch images

velociraptor111 commented 4 years ago

Ah okay thanks! Is there an approximate time of when the next release is scheduled?

yuanzhua commented 4 years ago

We are actively working on the MXNet and PyTorch images. We do not have an ETA yet.

velociraptor111 commented 4 years ago

Okay thanks. Closing issue