aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
134 stars 72 forks source link

renaming of mxnet-model-server in sagemaker-inference package 1.5.3 causing entrypoint with command `serve` to fail #88

Open RZachLamberty opened 3 years ago

RZachLamberty commented 3 years ago

Describe the bug sagemaker-inference recently (10/15) released v1.5.3, which included this commit updating the name of the model server artifact and command from mxnet-model-server to multi-model-server.

all containers defined in this repository install sagemaker-inference as a dependency of this repo itself, on lines

RUN pip install --no-cache-dir "sagemaker-pytorch-inference<2"

and this repo's setup.py has an install_requires which includes sagemaker-inference>=1.3.1. as a result, sagemaker-inference=1.5.3 installed.

so while the Dockerfile's CMD value (which calls mxnet-model-server directly) will succeed, attempts to use the ENTRYPOINT with serve as a build arg will fail with message:

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 22, in <module>
    serving.main()
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/serving.py", line 39, in main
    _start_model_server()
  File "/opt/conda/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.6/site-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.6/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/opt/conda/lib/python3.6/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_serving_container/serving.py", line 35, in _start_model_server
    model_server.start_model_server(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 94, in start_model_server
    subprocess.Popen(multi_model_server_cmd)
  File "/opt/conda/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/opt/conda/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'multi-model-server': 'multi-model-server'

To reproduce

  1. build any container
  2. mount a model and inference.py (e.g. half_plus_three) into /opt/ml/model
  3. docker run [tag name] serve

Expected behavior tensorflow serving serves the mounted model / inference.py

System information A description of your system. Please provide:

saifvazir commented 10 months ago

Hi @RZachLamberty , I stumbled upon your issue here. I was trying to create a custom docker image and had a similar issue. Installing multi-model-server (pip install multi-model-server) did away with this issue. You can give it a try :)