aws / deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
Other
995 stars 455 forks source link

[bug] Recent PyTorch images causing Zombie Process #3965

Closed dylanhellems closed 3 months ago

dylanhellems commented 3 months ago

Checklist

Concise Description: As of this May 22nd release of the PyTorch 2.1.0 images, our SageMaker Endpoints and Batch Transform Jobs using the new images have been failing. No obvious errors are thrown other than a psutil.ZombieProcess: PID still exists but it's a zombie from the pytorch_serving entrypoint.

DLC image/dockerfile: 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8

Current behavior: SageMaker Endpoints and Batch Transform Jobs are failing with a psutil.ZombieProcess: PID still exists but it's a zombie error from the pytorch_serving entrypoint.

Expected behavior: SageMaker Endpoints and Batch Transform Jobs work as expected.

Additional context: We had previously been using the 2.1.0-cpu-py310 and 2.1.0-gpu-py310 images but have had to pin the images back to their May 14th releases. The error is present in both pytorch-training and pytorch-inference. We made no changes to our deployments during this time, they simply started to fail out of the blue once the new image was released.

Here is the full stacktrace from a failed Batch Transform Job:

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
    _start_torchserve()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
    torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
    ts_process = _retrieve_ts_server_process()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    raise ImportError(msg)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    ret['name'] = name
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    stime = float(values['stime']) / CLOCK_TICKS
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    name = decode(name)

psutil.ZombieProcess: PID still exists but it's a zombie (pid=104)
greeshmaPr commented 3 months ago

We too are facing the same error. Traceback is :

['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model'] serving.main() File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main _start_torchserve() File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f return Retrying(*dargs, dkw).call(f, *args, *kw) File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 257, in call return attempt.get(self._wrap_exception) File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call attempt = Attempt(fn(args, kwargs), attempt_number, False) File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve torchserve.start_torchserve(handler_service=HANDLER_SERVICE) File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve ts_process = _retrieve_ts_server_process() File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f return Retrying(*dargs, dkw).call(f, *args, *kw) File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 266, in call raise attempt.get() File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 301, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/opt/conda/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/opt/conda/lib/python3.9/site-packages/retrying.py", line 251, in call attempt = Attempt(fn(args, kwargs), attempt_number, False) File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process if TS_NAMESPACE in process.cmdline(): File "/opt/conda/lib/python3.9/site-packages/psutil/init.py", line 719, in cmdline return self._proc.cmdline() File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1714, in wrapper return fun(self, *args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1853, in cmdline self._raise_if_zombie() File "/opt/conda/lib/python3.9/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie raise ZombieProcess(self.pid, self._name, self._ppid)

The base image that we are using is 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.13.1-cpu-py39

sirutBuasai commented 3 months ago

Hi, we are tracking this issue internally. The current fix is in progess with this https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/166. Alternatively, a quick workaround if running the DLC manually would be to add --init flag to the command. Eg:

docker run --init --name sagemaker_pt_dlc 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-graviton:2.1.0-cpu-py310-ubuntu20.04-sagemaker serve
conti748 commented 3 months ago

Hi @sirutBuasai,

I am working on a Batch-Transform job using a pytorch-model and the 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310 and I am experiencing the same error.

I saw the new release 2.0.24 of the package sagemaker-pytorch-inference-toolkit, I tried installing it on the image using the requirements.txt, but I got the same error.

adrien-code-it commented 3 months ago

Hi @sirutBuasai,

I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0.

I saw here that the fix was merged in aws:master two days ago (https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/166), however my latest deployment today still fails during torchserve with the error: psutil.ZombieProcess: PID still exists but it's a zombie.

When this fix will be available for deploying models ? Regards

alan1420 commented 3 months ago

@conti748 @adrien-code-it I've tried putting git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git in requirements.txt and it works

5agado commented 3 months ago

Same situation as @conti748. I tried to add to the model-inference-requirements as suggested by @alan1420 , but didn't work. Using 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker as base image.

angarsky commented 3 months ago

Get the same traceback as @dylanhellems: I've compared it by filenames and lines.

Base image in our case: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker

We are experiencing intermittent errors on inference endpoints during new cold container starts (scaling). Usually several next requests to endpoint resolve the issue, but yeah - it's not stable behaviour.

Our Dockerfile is:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker

# Update torch version to resolve the issue with `mmcv` and `import mmdet.apis`.
# Similar issue: https://github.com/open-mmlab/mmdetection/issues/4291
RUN pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cpu

# Install MMDetection framework.
RUN pip install -U openmim && \
  mim install mmengine && \
  mim install mmcv && \
  mim install mmdet

# Install some extra pip packages.
RUN pip install imutils sagemaker flask

# An attempt to fix permissions issue.
RUN mkdir -p /logs && chmod -R 777 /logs

# NOTE: SageMaker in a local mode overrides the SAGEMAKER_* variables.
ENV AWS_DEFAULT_REGION us-east-1

# Use single worker for a serverless mode.
ENV SAGEMAKER_MODEL_SERVER_WORKERS 1

# Cleanup
RUN pip cache purge \
  && rm -rf /tmp/tmp* \
  && rm -iRf /root/.cache

EXPOSE 8080 8081
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
CMD ["torchserve", "--start", "--ts-config", "/home/model-server/config.properties", "--model-store", "/home/model-server/"]
5agado commented 3 months ago

The recent new releases would have solved all the issues, if sagemaker-pytorch-inference was updated to include the fix, instead it is still stuck to 2.0.23 :/

sirutBuasai commented 3 months ago

Hi, We are in the process of upgrading toolkit versions in PyTorch Inference DLCs. Please track the following progress for each images here: PyTorch 2.2 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/pull/3984 PyTorch 2.2 SageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/pull/3985 PyTorch 2.1 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/pull/3986 PyTorch 2.1 sageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/pull/3987 PyTorch 1.13 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/pull/3988

Once PRs are merged, I will update when the images are publicly released again.

conti748 commented 3 months ago

@5agado @angarsky @adrien-code-it The only solution I found was to roll-back to the image 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker-v1.8"

sirutBuasai commented 3 months ago

Hi all, Patched images for PT 2.1 and PT 2.2 are released. See linked release tags.

PyTorch 2.2 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.13-pt-sagemaker-2.2.0-inf-py310 PyTorch 2.2 SageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.9-pt-graviton-sagemaker-2.2.1-inf-cpu-py310 PyTorch 2.1 SageMaker Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.12-pt-sagemaker-2.1.0-inf-py310 PyTorch 2.1 sageMaker Graviton Inference DLC: https://github.com/aws/deep-learning-containers/releases/tag/v1.10-pt-graviton-sagemaker-2.1.0-inf-cpu-py310

PT 1.13 is still WIP, will update release status once merged and built.

sirutBuasai commented 3 months ago

PT 1.13 has been released: https://github.com/aws/deep-learning-containers/releases/tag/v1.26-pt-sagemaker-1.13.1-inf-cpu-py39

All images are patched, closing issue.