Zombie process exception

5agado commented 1 month ago

Describe the bug Getting zombie process exception as already reported for the sagemaker-inference-toolkit

To reproduce Using 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker and custom inference script in a batch-transform causes to trigger such error. Even a simple initial time.sleep(60) in the inference.py script can be used to trigger the error. A custom requirements.txt file also needs to be provided with custom inference script.

Here the full traceback:

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    serving.main()
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
    _start_torchserve()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
    return attempt.get(self._wrap_exception)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
    torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
    ts_process = _retrieve_ts_server_process()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    return self._proc.cmdline()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    return fun(self, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    self._raise_if_zombie()
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)

System information A description of your system. Please provide:

Sagemaker model image: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker
Sagemaker model mode: single-mode
Batch-transform instance type: ml.g4dn.2xlarge
Batch-transform Invocation timeout in seconds: 600

rauldiaz commented 1 month ago

Hi! Any luck solving this? I am in the same situation for the pytorch-inference:2.2.0-cpu-py310 image. In serverless mode, I get the same error. I can however deploy it in a real-time endpoint.

5agado commented 1 month ago

@rauldiaz this was the fix, but it wasn't properly propagated to the instances.

The recent new releases would have solved all the issues, if they updated sagemaker-pytorch-inference, instead it is still stuck to 2.0.23 :/

Can see also the ongoing conversation here.

Itto1992 commented 1 month ago

I am using 763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.2.0-cpu-py310-ubuntu20.04-sagemaker-v1.12 in serverless mode, I encountered this error. I also tried an older image (763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8), but I also failed to invoke inference. Anyone solved this problem?

Itto1992 commented 1 month ago

I am using 763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.2.0-cpu-py310-ubuntu20.04-sagemaker-v1.12 in serverless mode, I encountered this error. I also tried an older image (763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker-v1.8), but I also failed to invoke inference. Anyone solved this problem?

Hi there,

I wanted to report that after a night, the issue seemed to resolve itself, and everything was working fine. However, when I updated the endpoint, the same error occurred again. Is this a known issue that tends to happen shortly after deployment?

Thanks!

aws / sagemaker-pytorch-inference-toolkit

Zombie process exception #165