Open charlietruong-wk opened 8 months ago
I am having same issue with sg inference: 1.10.1 and multimodel server: 1.1.11
Same problem with sagemaker-inference 1.7.1
and multi-model-server 1.1.8
.
Try updating python version as well i updated ubuntu version of my docker version. @andre-marcos-perez
Hey, installing psutil
version 5.9.5
first worked.
RUN pip3 install --upgrade pip && \
pip3 install multi-model-server==1.1.8 && \
pip3 install psutil==5.9.5 && \
pip3 install sagemaker-inference==1.7.1
Likely solved by https://github.com/aws/sagemaker-inference-toolkit/pull/133
Describe the bug We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.
The change in psutil may be this fix on their end with what they recognize as a ZombieProcess. https://github.com/giampaolo/psutil/pull/2288
We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:
https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276
To reproduce Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.
Expected behavior The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.
Screenshots or logs Here is a traceback we are seeing:
System information
Additional context n/a