Closed dylanhellems closed 3 months ago
We too are facing the same error. Traceback is :
['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model'] serving.main() File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main _start_torchserve() File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f return Retrying(*dargs, dkw).call(f, *args, *kw) File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 257, in call return attempt.get(self._wrap_exception) File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call attempt = Attempt(fn(args, kwargs), attempt_number, False) File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve torchserve.start_torchserve(handler_service=HANDLER_SERVICE) File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve ts_process = _retrieve_ts_server_process() File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f return Retrying(*dargs, dkw).call(f, *args, *kw) File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 266, in call raise attempt.get() File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call attempt = Attempt(fn(args, kwargs), attempt_number, False) File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process if TS_NAMESPACE in process.cmdline(): File "/opt/conda/lib/python3.9/site-packages/psutil/init.py", line 719, in cmdline return self._proc.cmdline() File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1714, in wrapper return fun(self, *args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1853, in cmdline self._raise_if_zombie() File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie raise ZombieProcess(self.pid, self._name, self._ppid)
The base image that we are using is 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.13.1-cpu-py39
Hi, we are tracking this issue internally. The current fix is in progess with this https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/166. Alternatively, a quick workaround if running the DLC manually would be to add --init
flag to the command.
Eg:
docker run --init --name sagemaker_pt_dlc 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-graviton:2.1.0-cpu-py310-ubuntu20.04-sagemaker serve
Hi @sirutBuasai,
I am working on a Batch-Transform job using a pytorch-model and the 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310
and I am experiencing the same error.
I saw the new release 2.0.24
of the package sagemaker-pytorch-inference-toolkit, I tried installing it on the image using the requirements.txt
, but I got the same error.
Hi @sirutBuasai,
I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0.
I saw here that the fix was merged in aws:master
two days ago (https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/166
), however my latest deployment today still fails during torchserve
with the error: psutil.ZombieProcess: PID still exists but it's a zombie
.
When this fix will be available for deploying models ? Regards
@conti748 @adrien-code-it I've tried putting git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git
in requirements.txt and it works
Same situation as @conti748. I tried to add to the model-inference-requirements as suggested by @alan1420 , but didn't work.
Using 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker
as base image.
Get the same traceback as @dylanhellems: I've compared it by filenames and lines.
Base image in our case: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker
We are experiencing intermittent errors on inference endpoints during new cold container starts (scaling). Usually several next requests to endpoint resolve the issue, but yeah - it's not stable behaviour.
Our Dockerfile is:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker
# Update torch version to resolve the issue with `mmcv` and `import mmdet.apis`.
# Similar issue: https://github.com/open-mmlab/mmdetection/issues/4291
RUN pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cpu
# Install MMDetection framework.
RUN pip install -U openmim && \
mim install mmengine && \
mim install mmcv && \
mim install mmdet
# Install some extra pip packages.
RUN pip install imutils sagemaker flask
# An attempt to fix permissions issue.
RUN mkdir -p /logs && chmod -R 777 /logs
# NOTE: SageMaker in a local mode overrides the SAGEMAKER_* variables.
ENV AWS_DEFAULT_REGION us-east-1
# Use single worker for a serverless mode.
ENV SAGEMAKER_MODEL_SERVER_WORKERS 1
# Cleanup
RUN pip cache purge \
&& rm -rf /tmp/tmp* \
&& rm -iRf /root/.cache
EXPOSE 8080 8081
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
CMD ["torchserve", "--start", "--ts-config", "/home/model-server/config.properties", "--model-store", "/home/model-server/"]
The recent new releases would have solved all the issues, if sagemaker-pytorch-inference
was updated to include the fix, instead it is still stuck to 2.0.23 :/
Hi, We are in the process of upgrading toolkit versions in PyTorch Inference DLCs. Please track the following progress for each images here: PyTorch 2.2 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/pull/3984 PyTorch 2.2 SageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/pull/3985 PyTorch 2.1 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/pull/3986 PyTorch 2.1 sageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/pull/3987 PyTorch 1.13 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/pull/3988
Once PRs are merged, I will update when the images are publicly released again.
@5agado @angarsky @adrien-code-it
The only solution I found was to roll-back to the image 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8"
Hi all, Patched images for PT 2.1 and PT 2.2 are released. See linked release tags.
PyTorch 2.2 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.13-pt-sagemaker-2.2.0-inf-py310 PyTorch 2.2 SageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.9-pt-graviton-sagemaker-2.2.1-inf-cpu-py310 PyTorch 2.1 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.12-pt-sagemaker-2.1.0-inf-py310 PyTorch 2.1 sageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.10-pt-graviton-sagemaker-2.1.0-inf-cpu-py310
PT 1.13 is still WIP, will update release status once merged and built.
PT 1.13 has been released: https://github.com/aws/deep-learning-containers/releases/tag/v1.26-pt-sagemaker-1.13.1-inf-cpu-py39
All images are patched, closing issue.
Checklist
Concise Description: As of this May 22nd release of the PyTorch 2.1.0 images, our SageMaker Endpoints and Batch Transform Jobs using the new images have been failing. No obvious errors are thrown other than a
psutil.ZombieProcess: PID still exists but it's a zombie
from thepytorch_serving
entrypoint.DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8
Current behavior: SageMaker Endpoints and Batch Transform Jobs are failing with a
psutil.ZombieProcess: PID still exists but it's a zombie
error from thepytorch_serving
entrypoint.Expected behavior: SageMaker Endpoints and Batch Transform Jobs work as expected.
Additional context: We had previously been using the
2.1.0-cpu-py310
and2.1.0-gpu-py310
images but have had to pin the images back to their May 14th releases. The error is present in bothpytorch-training
andpytorch-inference
. We made no changes to our deployments during this time, they simply started to fail out of the blue once the new image was released.Here is the full stacktrace from a failed Batch Transform Job: