Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at
breaking: Change Model server to Torchserve for PyTorch Inference #79

Closed dhanainme closed 4 years ago

dhanainme commented 4 years ago

Change Model server to Torchserve for PyTorch Inference

Use TorchServe in place of MMS for Pytorch Inference.

This PR depends on PR in sagemaker-inference-toolkit #58 to be merged before this can be supported. Hence existing integ tests should likely fail because of the same.

Testing :

Tested with SageMaker local based on the buildspec.yaml file.


tox -e py36 -- test/integration/local --build-image -s

This would fail with the following error & create a docker image : sagemaker-pytorch-inference:1.5.0-cpu-py3

Attaching to tmpce8xadei_algo-1-j3bfm_1
algo-1-j3bfm_1  | Traceback (most recent call last):
algo-1-j3bfm_1  |   File "/usr/local/bin/", line 21, in <module>
algo-1-j3bfm_1  |     from sagemaker_pytorch_serving_container import serving
algo-1-j3bfm_1  |   File "/opt/conda/lib/python3.7/site-packages/sagemaker_pytorch_serving_container/", line 18, in <module>
algo-1-j3bfm_1  |     from sagemaker_inference import torchserve
algo-1-j3bfm_1  | ImportError: cannot import name 'torchserve' from 'sagemaker_inference' (/opt/conda/lib/python3.7/site-packages/sagemaker_inference/
tmpce8xadei_algo-1-j3bfm_1 exited with code 1
Aborting on container exit...
Exception in thread Thread-1:

Now install chanages from PR - sagemaker-inference-toolkit (#89) manualy to this container & commit it before running the test again this time without the --build-image flag.

tox -e py36 -- test/integration/local -s

ubuntu@ip-172-31-65-0:~/ts/sagemaker-pytorch-inference-toolkit$ tox -e py36 -- test/integration/local -s
GLOB sdist-make: /home/ubuntu/ts/sagemaker-pytorch-inference-toolkit/
py36 inst-nodeps: /home/ubuntu/ts/sagemaker-pytorch-inference-toolkit/.tox/dist/
py36 installed: apipkg==1.5,attrs==19.3.0,bcrypt==3.1.7,boto3==1.14.19,botocore==1.17.19,certifi==2020.6.20,cffi==1.14.0,chardet==3.0.4,click==7.1.2,coverage==5.2,cryptography==2.9.2,docutils==0.15.2,execnet==1.7.1,Flask==1.1.1,future==0.18.2,gevent==20.6.2,greenlet==0.4.16,gunicorn==20.0.4,idna==2.7,importlib-metadata==1.7.0,inotify-simple==1.2.1,itsdangerous==1.1.0,Jinja2==2.11.2,jmespath==0.10.0,MarkupSafe==1.1.1,mock==4.0.2,more-itertools==8.4.0,numpy==1.19.0,packaging==20.4,paramiko==2.7.1,Pillow==7.2.0,pkg-resources==0.0.0,pluggy==0.13.1,protobuf==3.12.2,protobuf3-to-dict==0.1.5,psutil==5.7.0,py==1.9.0,pycparser==2.20,PyNaCl==1.4.0,pyparsing==2.4.7,pytest==5.4.3,pytest-cov==2.10.0,pytest-forked==1.2.0,pytest-xdist==1.32.0,python-dateutil==2.8.1,PyYAML==5.3.1,requests==2.20.0,retrying==1.3.3,s3transfer==0.3.3,sagemaker==1.68.0,sagemaker-containers==2.8.6.post2,sagemaker-inference==1.3.2.post1,sagemaker-pytorch-inference @ file:///home/ubuntu/ts/sagemaker-pytorch-inference-toolkit/.tox/dist/,scipy==1.5.1,six==1.15.0,smdebug-rulesconfig==0.1.4,torch==1.5.1,torchvision==0.6.1,typing==,urllib3==1.22,wcwidth==0.2.5,Werkzeug==1.0.1,zipp==3.1.0,zope.event==4.4,zope.interface==5.1.0
py36 runtests: PYTHONHASHSEED='2603046058'
py36 runtests: commands[0] | coverage run --rcfile .coveragerc --source sagemaker_pytorch_serving_container -m pytest test/integration/local -s
WARNING:root:pandas failed to import. Analytics features will be impaired or broken.
=========================================================================================================== test session starts ============================================================================================================
platform linux -- Python 3.6.9, pytest-5.4.3, py-1.9.0, pluggy-0.13.1 -- /home/ubuntu/ts/sagemaker-pytorch-inference-toolkit/.tox/py36/bin/python3.6
cachedir: .pytest_cache
rootdir: /home/ubuntu/ts/sagemaker-pytorch-inference-toolkit, inifile: setup.cfg
plugins: forked-1.2.0, cov-2.10.0, xdist-1.32.0
collected 4 items

test/integration/local/ WARNING:sagemaker:Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
WARNING:sagemaker:No framework_version specified, defaulting to version 0.4. framework_version will be required in SageMaker Python SDK v2. This is not the latest supported version. If you would like to use version 1.5.0, please add framework_version=1.5.0 to your constructor.
INFO:botocore.credentials:Found credentials in environment variables.
WARNING:sagemaker.local.image:Using the short-lived AWS credentials found in session. They might expire while running.
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ef1a18d0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ef1a1e10>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ef1a17b8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
Attaching to tmphbv2jo9s_algo-1-xr0n7_1
algo-1-xr0n7_1  | Model server started.
algo-1-xr0n7_1  | 2020-07-10 13:03:33,174 [INFO ] pool-1-thread-17 ACCESS_LOG - / "GET /ping HTTP/1.1" 200 4
!algo-1-xr0n7_1  | 2020-07-10 13:03:33,674 [INFO ] W-9005-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 132
algo-1-xr0n7_1  | 2020-07-10 13:03:34,124 [INFO ] W-9006-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 125
algo-1-xr0n7_1  | 2020-07-10 13:03:34,551 [INFO ] W-9009-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 122
algo-1-xr0n7_1  | 2020-07-10 13:03:34,718 [INFO ] W-9015-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 37
algo-1-xr0n7_1  | 2020-07-10 13:03:34,889 [INFO ] W-9000-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 43
algo-1-xr0n7_1  | 2020-07-10 13:03:35,052 [INFO ] W-9010-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 28
Gracefully stopping... (press Ctrl+C again to force)
test/integration/local/ WARNING:sagemaker:Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
WARNING:sagemaker:No framework_version specified, defaulting to version 0.4. framework_version will be required in SageMaker Python SDK v2. This is not the latest supported version. If you would like to use version 1.5.0, please add framework_version=1.5.0 to your constructor.
WARNING:sagemaker.local.image:Using the short-lived AWS credentials found in session. They might expire while running.
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec9c8080>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec9e3240>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec9e3c50>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
Attaching to tmp1evpzg0n_algo-1-1us1t_1
algo-1-1us1t_1  | Model server started.
algo-1-1us1t_1  | 2020-07-10 13:03:56,114 [INFO ] pool-1-thread-17 ACCESS_LOG - / "GET /ping HTTP/1.1" 200 3
!algo-1-1us1t_1  | 2020-07-10 13:03:56,305 [INFO ] W-9009-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 30
algo-1-1us1t_1  | 2020-07-10 13:03:56,493 [INFO ] W-9000-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 47
algo-1-1us1t_1  | 2020-07-10 13:03:56,678 [INFO ] W-9013-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 23
Gracefully stopping... (press Ctrl+C again to force)
test/integration/local/ WARNING:sagemaker:Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
WARNING:sagemaker:No framework_version specified, defaulting to version 0.4. framework_version will be required in SageMaker Python SDK v2. This is not the latest supported version. If you would like to use version 1.5.0, please add framework_version=1.5.0 to your constructor.
WARNING:sagemaker.local.image:Using the short-lived AWS credentials found in session. They might expire while running.
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec9e6b38>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec9e6f28>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec9e20b8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
Attaching to tmpwuf8pafw_algo-1-eqkyb_1
algo-1-eqkyb_1  | Model server started.
algo-1-eqkyb_1  | 2020-07-10 13:04:17,696 [INFO ] pool-1-thread-17 ACCESS_LOG - / "GET /ping HTTP/1.1" 200 4
!algo-1-eqkyb_1  | 2020-07-10 13:04:17,886 [INFO ] W-9015-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 30
Gracefully stopping... (press Ctrl+C again to force)
test/integration/local/ WARNING:sagemaker:Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
WARNING:sagemaker:No framework_version specified, defaulting to version 0.4. framework_version will be required in SageMaker Python SDK v2. This is not the latest supported version. If you would like to use version 1.5.0, please add framework_version=1.5.0 to your constructor.
WARNING:sagemaker.local.image:Using the short-lived AWS credentials found in session. They might expire while running.
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec870048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec870320>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20ec8700f0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /ping
Attaching to tmpzfzvz320_algo-1-61bac_1
algo-1-61bac_1  | Model server started.
algo-1-61bac_1  | 2020-07-10 13:04:38,944 [INFO ] pool-1-thread-3 ACCESS_LOG - / "GET /ping HTTP/1.1" 200 4
!algo-1-61bac_1  | 2020-07-10 13:04:38,968 [INFO ] W-9000-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 3
algo-1-61bac_1  | 2020-07-10 13:04:38,975 [INFO ] W-9001-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 1
algo-1-61bac_1  | 2020-07-10 13:04:38,981 [INFO ] W-9000-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 1
Gracefully stopping... (press Ctrl+C again to force)

============================================================================================================= warnings summary =============================================================================================================
  /home/ubuntu/ts/sagemaker-pytorch-inference-toolkit/test/integration/local/ PytestUnknownMarkWarning: Unknown pytest.mark.skip_cpu - is this a typo?  You can register custom marks to avoid this warning - for details, see

-- Docs:
================================================================================================= 4 passed, 1 warning in 91.55s (0:01:31) ================================================================================================== warning: Module sagemaker_pytorch_serving_container was never imported. (module-not-imported) warning: No data was collected. (no-data-collected)
py36 runtests: commands[1] | coverage report --fail-under=90 --include *sagemaker_pytorch_serving_container*
No data to report.
ERROR: InvocationError: '/home/ubuntu/ts/sagemaker-pytorch-inference-toolkit/.tox/py36/bin/coverage report --fail-under=90 --include *sagemaker_pytorch_serving_container*'
_________________________________________________________________________________________________________________ summary __________________________________________________________________________________________________________________```

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

dhanainme commented 4 years ago

I am not sure torchserve should be part of sagemaker-inference: aws/sagemaker-inference-toolkit#58 Shouldn't that logic be moved here instead?

Have moved the logic here.

nadiaya commented 4 years ago

Have fixed this.

copying src/sagemaker_pytorch_serving_container/ -> sagemaker_pytorch_inference-1.5.2.dev0/src/sagemaker_pytorch_serving_container
copying src/sagemaker_pytorch_serving_container/ -> sagemaker_pytorch_inference-1.5.2.dev0/src/sagemaker_pytorch_serving_container
copying src/sagemaker_pytorch_serving_container/ -> sagemaker_pytorch_inference-1.5.2.dev0/src/sagemaker_pytorch_serving_container
copying src/sagemaker_pytorch_serving_container/ -> sagemaker_pytorch_inference-1.5.2.dev0/src/sagemaker_pytorch_serving_container
copying src/sagemaker_pytorch_serving_container/etc/ -> sagemaker_pytorch_inference-1.5.2.dev0/src/sagemaker_pytorch_serving_container/etc
copying src/sagemaker_pytorch_serving_container/etc/ -> sagemaker_pytorch_inference-1.5.2.dev0/src/sagemaker_pytorch_serving_container/etc
Writing sagemaker_pytorch_inference-1.5.2.dev0/setup.cfg
Creating tar archive
removing 'sagemaker_pytorch_inference-1.5.2.dev0' (and everything under it)
twine runtests: commands[1] | twine check dist/*.tar.gz
Checking dist/sagemaker_pytorch_inference-1.5.2.dev0.tar.gz: PASSED, with warnings
  warning: `long_description_content_type` missing. defaulting to `text/x-rst`.
_______________________________________________________________________________________________________________________________________________________________________________ summary ________________________________________________________________________________________________________________________________________________________________________________
  flake8: commands succeeded
  twine: commands succeeded
  congratulations :)
dhanainme commented 4 years ago
algo-1-doyas_1  | 2020-07-17 20:07:21,364 [INFO ] W-9001-model_1 org.pytorch.serve.wlm.WorkerThread - Backend response time: 569
algo-1-doyas_1  | 2020-07-17 20:07:21,364 [INFO ] W-9000-model_1 TS_METRICS -|#Level:Host|#hostname:2c8a2bc735be,timestamp:1595016441
algo-1-doyas_1  | 2020-07-17 20:07:21,364 [INFO ] W-9001-model_1 TS_METRICS -|#Level:Host|#hostname:2c8a2bc735be,timestamp:1595016441
algo-1-doyas_1  | 2020-07-17 20:07:23,333 [INFO ] pool-1-thread-3 ACCESS_LOG - / "GET /ping HTTP/1.1" 200 3
algo-1-doyas_1  | 2020-07-17 20:07:23,334 [INFO ] pool-1-thread-3 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:2c8a2bc735be,timestamp:null
!algo-1-doyas_1  | 2020-07-17 20:07:23,357 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1
algo-1-doyas_1  | 2020-07-17 20:07:23,357 [INFO ] W-9000-model_1-stdout MODEL_METRICS - PredictionTime.Milliseconds:0.05|#ModelName:model,Level:Model|#hostname:2c8a2bc735be,requestID:f08bfa54-47bd-444a-a91a-b6aa56819ce7,timestamp:1595016443
algo-1-doyas_1  | 2020-07-17 20:07:23,358 [INFO ] W-9000-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 5
algo-1-doyas_1  | 2020-07-17 20:07:23,358 [INFO ] W-9000-model_1 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:2c8a2bc735be,timestamp:null
algo-1-doyas_1  | 2020-07-17 20:07:23,365 [INFO ] W-9001-model_1 org.pytorch.serve.wlm.WorkerThread - Backend response time: 1
algo-1-doyas_1  | 2020-07-17 20:07:23,365 [INFO ] W-9001-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 1
algo-1-doyas_1  | 2020-07-17 20:07:23,365 [INFO ] W-9001-model_1-stdout MODEL_METRICS - PredictionTime.Milliseconds:0.04|#ModelName:model,Level:Model|#hostname:2c8a2bc735be,requestID:bbeb14f3-c0c4-4de2-bf58-faf948458a31,timestamp:1595016443
algo-1-doyas_1  | 2020-07-17 20:07:23,365 [INFO ] W-9001-model_1 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:2c8a2bc735be,timestamp:null
algo-1-doyas_1  | 2020-07-17 20:07:23,371 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Backend response time: 0
algo-1-doyas_1  | 2020-07-17 20:07:23,371 [INFO ] W-9000-model_1 ACCESS_LOG - / "POST /invocations HTTP/1.1" 200 1
algo-1-doyas_1  | 2020-07-17 20:07:23,371 [INFO ] W-9000-model_1-stdout MODEL_METRICS - PredictionTime.Milliseconds:0.02|#ModelName:model,Level:Model|#hostname:2c8a2bc735be,requestID:407df61c-82c0-4059-b590-288551436fb6,timestamp:1595016443
algo-1-doyas_1  | 2020-07-17 20:07:23,371 [INFO ] W-9000-model_1 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:2c8a2bc735be,timestamp:null
Gracefully stopping... (press Ctrl+C again to force)

=========================================================================================================================================================================== warnings summary ===========================================================================================================================================================================
  /home/ubuntu/ts/sagemaker-pytorch-inference-toolkit/test/integration/local/ PytestUnknownMarkWarning: Unknown pytest.mark.skip_cpu - is this a typo?  You can register custom marks to avoid this warning - for details, see

-- Docs:
=============================================================================================================================================================== 4 passed, 1 warning in 136.40s (0:02:16) ===============================================================================================================================================================

Logs from a more recent run.

laurenyu commented 4 years ago

for why is named such - it contains a line to accommodate for the fact that MMS exits right away:

if Torchserve doesn't need that, then we should be able to remove the file altogether (rather than just renaming it)

dhanainme commented 4 years ago

for why is named such - it contains a line to accommodate for the fact that MMS exits right away:

if Torchserve doesn't need that, then we should be able to remove the file altogether (rather than just renaming it)

There may not be any differences between MMS & TS for this as the behaviour would not have changed.

dhanainme commented 4 years ago

Looks like 1 of the integ tests consumes DLC container for a test.

From test logs :

[Container] 2020/07/27 18:46:51 Running command test_cmd="IGNORE_COVERAGE=- tox -e py36 -- test/integration/local --build-image --push-image --dockerfile-type pytorch --region $AWS_DEFAULT_REGION --docker-base-name $ECR_REPO --aws-id $ACCOUNT --framework-version $FRAMEWORK_VERSION --processor cpu --tag $GENERIC_TAG” ✅ - PASSING

[Container] 2020/07/27 18:54:25 Running command test_cmd="IGNORE_COVERAGE=- tox -e py36 -- test/integration/local --build-image --push-image --dockerfile-type dlc.cpu --region $AWS_DEFAULT_REGION --docker-base-name $ECR_REPO --aws-id $ACCOUNT --framework-version $FRAMEWORK_VERSION --processor cpu --tag $DLC_CPU_TAG” - ❌ NOT PASSING

This may not pass until DLC PR is merged & it inturn depends on this PR

