Closed velociraptor111 closed 4 years ago
Hello @velociraptor111,
I think most likely this issue isn't with the Python SDK, but probably the code in the MXNet container itself. In particular the package that it is using, which is: https://github.com/aws/sagemaker-inference-toolkit.
In particular this line: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L61. That line is meant to keep the container running, however replacing it with a different solution might be better due to the error you are getting.
As a work around, you might need to modify the inference-tool-kit to do like a 3 second sleep right before the tail call or change it to another solution. The simplest answer I can think of is a while true.
I'm going to transfer this issue to the toolkit.
Oh I see, why is it failing for mine but not the current default MXNet container since my docker image is based from the default MXNet container? All I added was one line which was to import GluonCV.
It seems like the fix for this might take awhile, in the meantime is there any other way to simply import an additional python package like GluonCV , to be accessible in my entry_point.py
?
Thank you in advance
@velociraptor111,
Good question, I'm not too sure about that one. Is the error consistent?
Usually requirements.txt is the way to go when you want to add libraries into your container, but for the MXNet container I believe requirements.txt support was forgotten and needs to be added. Without that, the only option is to go about it the way you did, which was to create a new image with your dependencies, which is never ideal.
One thing you could do, is to modify your entry_point.py to pip install your modules, which is pretty hackish, but most likely a lot less work. I do recommend using "local_mode" for your iterations, as it is quicker than waiting for instances to provision on SageMaker.
Apologies for the experience, I'll see what I can do.
Yeah, the error is consistent. I have tried deploying two times now and still received the same error.
Usually_ requirements.txt is the way to go when you want to add libraries into your container, but for the MXNet container I believe requirements.txt support was forgotten and needs to be added. Without that, the only option is to go about it the way you did, which was to create a new image with your dependencies, which is never ideal.
No wonder it didn't work for me .. I found a person also similar issue of importing third party libraries but the solution outlined there still didn't work for me when I tried. Here is the link to my comment on an OLD issue: https://github.com/aws/sagemaker-python-sdk/issues/664#issuecomment-541973833
One thing you could do, is to modify your entry_point.py to pip install your modules, which is pretty hackish, but most likely a lot less work. I do recommend using "local_mode" for your iterations, as it is quicker than waiting for instances to provision on SageMaker.
Yeah I was thinking of doing this, but the thought of it seems pretty hackish and not sure if it might effect inference time.
Okay, please let me know how to fix this.
My team and myself are seriously considering to use AWS Sagemaker for a huge chunk of our application & the support has been the best so far !
@velociraptor111,
Gotcha, thanks for the information.
Looks like there are two problems, one for the tail call and one for requirements.txt.
I'll start with the tail call, since that can potentially cause jobs to fail regardless of the dependencies in them or not.
Hi @ChoiByungWook ,
I have some updates regarding the bug. I found some inconsistencies even with the default MXNet Serving container.
Here is the command:
sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
role = role,
entry_point = 'entry_point.py',
py_version='py3',
framework_version='1.4.1',
sagemaker_session = sagemaker_session)
transformer = sagemaker_model.transformer(instance_count=1, instance_type='ml.m4.xlarge', output_path=batch_output)
transformer.transform(data=batch_input, content_type='application/x-image')
transformer.wait()
I ran the same script three times and here are the results & error messages:
FIRST RUN
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 8, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 42, in main
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 57, in start_model_server
mms_process = subprocess.Popen(mxnet_model_server_cmd)
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'mxnet-model-server'
SECOND RUN
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 8, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 42, in main
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'tail'
THIRD RUN SUCCESS
The error is inconsistent. I suspect it's something to do with delay and timers in the server code as you have previously mentioned.
Hello @velociraptor111,
Thank you for reporting all of this, I apologize for the frustrating experience.
That first run's error is incredibly concerning. I'm not too sure how to solve that issue, as we are relying on mxnet-model-server to be run. I think the only thing I can think of is to put a sleep or have multiple retries on starting the model server, however I'm not too sure if this is a Docker level problem or even lower.
As for the second run: I just submitted a PR which won't call tail anymore. https://github.com/aws/sagemaker-inference-toolkit/pull/11
I'm going to do some more testing before merging that PR.
I'll work on the requirements.txt in between.
Requirements.txt: https://github.com/aws/sagemaker-inference-toolkit/pull/12
Might take a few days before these get reviewed, tested properly and released.
Both changes have been merged. The next steps are to make sure the containers that are dependent on this change consume them.
Which means updating the following:
Thanks for the timely fix!
Once those additional updates have been done, it will be automatically reflected in the current SDK right?
Opened PRs to update the respective framework images:
these will become part of the latest pre-built PyTorch and MXNet images, which are pulled by the SDK. As long as you're using the latest framework version, yes, the changes will be automatically reflected.
Hi just a follow up error after trying out again. Please check my additional followup comments in https://github.com/aws/sagemaker-inference-toolkit/pull/11
apologies for the lack of update - the changes have still not been released to the MXNet and PyTorch images
Ah okay thanks! Is there an approximate time of when the next release is scheduled?
We are actively working on the MXNet and PyTorch images. We do not have an ETA yet.
Okay thanks. Closing issue
Describe the problem
I needed to add GluonCV library in my code environment, and since the Default MXNet container does not have the python package, I needed to create a custom image with the python package installed.
I got the default MXNet container from here: https://github.com/aws/sagemaker-mxnet-serving-container and follow all the instructions. To include GluonCV, i then simply added this to the docker file and build the image
I build the image, then uploaded it to a AWS ECR.
I am able to verify that the docker image has been successfully uploaded and I have a valid URI like so:
552xxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3
THEN, when instantiating the MXNet model, I added a reference to this image URI like so
BUT i got an error message: Here is the full log