aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
370 stars 82 forks source link

psutil 5.9.6 seems to be throwing ZombieProcess when retrieving the mms process #132

Open charlietruong-wk opened 8 months ago

charlietruong-wk commented 8 months ago

Describe the bug We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.

The change in psutil may be this fix on their end with what they recognize as a ZombieProcess. https://github.com/giampaolo/psutil/pull/2288

We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:

https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276

To reproduce Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.

Expected behavior The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.

Screenshots or logs Here is a traceback we are seeing:

  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 99, in start_model_server
    mms_process = _retry_retrieve_mms_server_process(env.startup_timeout)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 199, in _retry_retrieve_mms_server_process
    return retrieve_mms_server_process()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python3.8/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.8/site-packages/sagemaker_inference/model_server.py", line 206, in _retrieve_mms_server_process
    if MMS_NAMESPACE in process.cmdline():
  File "/usr/local/lib64/python3.8/site-packages/psutil/__init__.py", line 702, in cmdline
    return self._proc.cmdline()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1650, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1788, in cmdline
    self._raise_if_zombie()
  File "/usr/local/lib64/python3.8/site-packages/psutil/_pslinux.py", line 1693, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)

System information

Additional context n/a

parthvadhadiya commented 8 months ago

I am having same issue with sg inference: 1.10.1 and multimodel server: 1.1.11

andre-marcos-perez commented 8 months ago

Same problem with sagemaker-inference 1.7.1 and multi-model-server 1.1.8.

parthvadhadiya commented 8 months ago

Try updating python version as well i updated ubuntu version of my docker version. @andre-marcos-perez

andre-marcos-perez commented 8 months ago

Hey, installing psutil version 5.9.5 first worked.

RUN pip3 install --upgrade pip && \
    pip3 install multi-model-server==1.1.8 && \
    pip3 install psutil==5.9.5 && \
    pip3 install sagemaker-inference==1.7.1
andre-marcos-perez commented 8 months ago

Likely solved by https://github.com/aws/sagemaker-inference-toolkit/pull/133