Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
131
stars
70
forks
source link
Fix integration tests and update Python versions #154
Updated the classifiers in setup to update the Python versions to Python 3.8, 3.9 and 3.10.
Updated the test dependencies to use later versions and removed sagemaker-containers.
test/container/2.0.1/:
Created a CPU DLC Dockerfile with the PyTorch 2.0.1 TorchServe SageMaker DLC image and installed the PyTorch inference toolkit binary (similar to test/container/1.10.2/Dockerfile.dlc.cpu).
Created a GPU DLC Dockerfile with PyTorch 2.0.1 TorchServe SageMaker DLC image and installed the PyTorch inference toolkit binary (similar to test/container/1.10.2/Dockerfile.dlc.gpu).
Updated default_output_fn to check type of prediction with is instead of == to pass flake8 check.
UPDATE:
Updated GPU instance type to g4dn.12xlarge from p3.8xlarge since the latter is often unavailable in the us-west-2 region.
Added GPU and CPU Dockerfiles for PyTorch 2.0.0 as well (since this version is also currently supported by DLC team)
Updated buildspec.yml to run integration tests for both frameworks i.e. 2.0.0 and 2.0.1.
EIA is no longer an active project for the DLC team (https://github.com/aws/deep-learning-containers/pull/2466). The last image released by them is 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-eia:1.5.1-cpu-py38-ubuntu20.04. Commented out EIA test commands in buildspec.yml since they lead to errors while building the EIA image (pinned numpy version is not available in the base image which is used to build the EIA image). Building this image is of no use since these are anyways skipped due to this condition:
@pytest.mark.skip(
reason="Latest EIA version - 1.5.1 uses mms. Enable when EIA images use torchserve"
Included the environment variable NCCL_SHM_DISABLE=1 while creating the SageMaker endpoint for GPU tests to avoid this error:
NCCL Error 2: unhandled system error
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 142, in transform
result = self._run_handler_function(
File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 276, in _run_handler_function
result = func(*argv_context)
File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 260, in _default_transform_fn
prediction = self._run_handler_function(self._predict_fn, *(data, model))
File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 272, in _run_handler_function
result = func(*argv)
File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 125, in default_predict_fn
output = model(input_data)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 172, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 91, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 67, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 58, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error
Please note that this error is specifically related to torch.nn.DataParallel (https://github.com/pytorch/pytorch/issues/73775) which is used while creating the MNIST model. It causes the worker to die and leads to error code 500. Please note that this issue does not come up when running the same tests on a g4dn.xlarge EC2 instance.
Pinned the versions of the dependencies in setup.py.
Changed default versions of --dockerfile-type and --framework-version in conftest.py.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Issue #, if available:
Description of changes:
Note: For more details about changes in SageMaker Python SDK from v1 to v2, please refer to this link.
The following files were updated:
classifiers
insetup
to update the Python versions to Python 3.8, 3.9 and 3.10.sagemaker-containers
.envlist
to removepy36
,py37
and addpy38
,py39
,py310
.sagemaker-containers
.#
to passflake8
check.serializers
anddeserializers
fromsagemaker
.content_types
fromsagemaker_inference
.CONTENT_TYPE_TO_SERIALIZER_MAP
andACCEPT_TYPE_TO_SERIALIZER_MAP
.PyTorchModel
, replacedimage
withimage_uri
(SageMaker Python SDK v2 has replaced the argument nameimage
withimage_uri
).PyTorchModel
, replacedimage
withimage_uri
.PyTorchModel
, replacedimage
withimage_uri
.FRAMEWORK_VERSION
to2.0.1
.SETUP_CMDS
to use Python 3.8 instead of Python 3.6.py36
,py37
topy38
,py39
,py310
.py36
topy38
.ami-03e3ef8c92fdb39ad
.model_fn
to load model on GPU.py36
,py37
topy38
,py39
,py310
.default_output_fn
to check type ofprediction
withis
instead of==
to passflake8
check.UPDATE:
g4dn.12xlarge
fromp3.8xlarge
since the latter is often unavailable in theus-west-2
region.buildspec.yml
to run integration tests for both frameworks i.e.2.0.0
and2.0.1
.763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-eia:1.5.1-cpu-py38-ubuntu20.04
. Commented out EIA test commands inbuildspec.yml
since they lead to errors while building the EIA image (pinnednumpy
version is not available in the base image which is used to build theEIA
image). Building this image is of no use since these are anyways skipped due to this condition:NCCL_SHM_DISABLE=1
while creating the SageMaker endpoint for GPU tests to avoid this error:Please note that this error is specifically related to
torch.nn.DataParallel
(https://github.com/pytorch/pytorch/issues/73775) which is used while creating the MNIST model. It causes the worker to die and leads to error code 500. Please note that this issue does not come up when running the same tests on ag4dn.xlarge
EC2 instance.setup.py
.--dockerfile-type
and--framework-version
inconftest.py
.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.