aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
379 stars 82 forks source link

fix: wait for mms server process to finish instead of tailing dev null #11

Closed ChoiByungWook closed 4 years ago

ChoiByungWook commented 4 years ago

Issue #, if available: User running into memory issues due to tail. See for more information: https://github.com/aws/sagemaker-inference-toolkit/issues/9

Description of changes:

User ran into an error due to the command "tail -f /dev/null".

The tail call is meant to keep the container running, instead I now wait on the server process to finish or return an error code. The reason why the process that is responsible for starting the mxnet-model-server can't be used to wait is because MMS starts another subprocess, which for some reason can't be tracked by calling children() on the mms_process. When looking at the parent of that child's process, it points to bash.

For this reason we look for the cmdline that the process was created to do. Which comes from here: https://github.com/awslabs/mxnet-model-server/blob/master/mms/model_server.py#L56

Testing

flake8: commands succeeded
  twine: commands succeeded
  py27: commands succeeded
  py36: commands succeeded
  congratulations :)

MXNet serving I modified MXNet 1.4.1 CPU Dockerfile to install the modified version of this package and ran the local and SageMaker integration tests. local

tox -e py27 test/integration/local -- --docker-base-name preprod-mxnet-serving --tag 1.4.1-cpu-py2-modified-toolkit --py-version 2 --framework-version 1.4.1 --processor cpu

test/integration/local/test_default_model_fn.py::test_default_model_fn[py2-cpu] PASSED                                                                                                                                                                 [ 20%]
test/integration/local/test_default_model_fn.py::test_default_model_fn_content_type[py2-cpu] algo-1-yhd7y_1  | 2019-10-17 00:14:31,625 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [00:14:31] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.11.0. Attempting to upgrade...
PASSED                                                                                                                                                    [ 40%]algo-1-yhd7y_1  | 2019-10-17 00:14:31,628 [WARN ] W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [00:14:31] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!

test/integration/local/test_gluon_hosting.py::test_gluon_hosting[py2-cpu] PASSED                                                                                                                                                                                                                                                                                                                                                                                                                [ 60%]
test/integration/local/test_hosting.py::test_hosting[py2-cpu] PASSED                                                                                                                                                                                                                                                                                                                                                                                                                            [ 80%]
test/integration/local/test_onnx.py::test_onnx_import[py2-cpu] PASSED                                                                                                                                                                       [100%]

============================================================================================================ 5 passed in 99.73 seconds

sagemaker

tox -e py36 test/integration/sagemaker -- --aws-id 633083500428 --docker-base-name sagemaker-mxnet-serving --instance-type ml.m4.xlarge --tag 1.4.1-cpu-py2-modified-toolkit

test/integration/sagemaker/test_batch_transform.py::test_batch_transform[py3-cpu] PASSED                                                                                                                                                    [ 33%]
test/integration/sagemaker/test_elastic_inference.py::test_elastic_inference[py3-cpu] SKIPPED                                                                                                                                               [ 66%]
test/integration/sagemaker/test_hosting.py::test_hosting[py3-cpu] PASSED                                                                                                                                                                    [100%]

PyTorch

test/integration/local/test_serving.py::test_serve_json_npy PASSED                                                                                                                                                                          [ 25%]
test/integration/local/test_serving.py::test_serve_csv PASSED                                                                                                                                                                               [ 50%]
test/integration/local/test_serving.py::test_serve_cpu_model_on_gpu SKIPPED                                                                                                                                                                 [ 75%]
test/integration/local/test_serving.py::test_serving_calls_model_fn_once PASSED                                                                                                                                                             [100%]

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

velociraptor111 commented 4 years ago

Hi, thanks for the update. So I tried again but I still got the same error! Weird that it seems like it's still referencing the old file.

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 8, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 42, in main
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'tail'

Here is the code that I'm running

import sagemaker
from sagemaker.mxnet.model import MXNetModel
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = get_execution_role()

print(sagemaker.__version__) #prints version 1.43.4.post1

batch_input = 's3://{}/test_images'.format(bucket) 
batch_output = 's3://{}/combined_results'.format(bucket) 

sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
                             role = role, 
                             entry_point = 'combined_entry_point.py',
                             dependencies=['requirements.txt'],
                             py_version='py3',
                             framework_version='1.4.1',
                             sagemaker_session = sagemaker_session)

transformer = sagemaker_model.transformer(instance_count=1, instance_type='ml.m4.xlarge', output_path=batch_output)
transformer.transform(data=batch_input, content_type='application/x-image')

Am I missing some steps here? As far as I know the changes are all automatic right?

@laurenyu @ChoiByungWook

Or maybe perhaps the docker image where sagemaker-python-sdk if pulling from is not updated with the new code changes yet?