Closed namannandan closed 2 months ago
@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277
@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277
Thanks @visinfo that makes sense, updated the PR.
I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0.
I saw here that the fix was merged in aws:master
two days ago, however my latest deployment still fails during torchserve
with the error: psutil.ZombieProcess: PID still exists but it's a zombie
.
When this fix will be available for deploying models ? Regards
As for @adrien-code-it , I also tried on a new model and 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker
as well as 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker
, and still get the error.
@namannandan, @visinfo is there something we need to do to deploy using the update? Or when will it be distributed to all instances?
@5agado I was able to deploy my model by adding a requirements.txt
file alongside my inference.py
file, and specify to pip install the latest sagemaker-pytorch-inference-toolkit
:
git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git
Although it's not a permanent solution (I would prefer pulling a fixed version, not the latest), it's working as of now. Moreover, when deploying, it still fails once, but then it succeeds the second time..
@adrien-code-it are you deploying the model as endpoint, or using in batch-transform? I tried the same with the latter, but doesn't work for me (I think related to the "succeeds the second time" aspect you mention there)
@5agado the fix in requirements.txt
seems to only work when deploying the model as endpoint (for inference, in my case) :(
For batch-transform, unfortunately I didn't see any fix working... Maybe @namannandan has a solution ?
Description of changes: When checking to see if the TorchServe process is running, we iterate through the current list of running processes using
psutil
: https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/36a842e374766a088f21906ce17496e88a140a1b/src/sagemaker_pytorch_serving_container/torchserve.py#L183-L188Calling the
command()
psutil API on a zombie process raises thepsutil.ZombieProcess
exception. This unhandled exception causes TorchServe to be stopped which is not expected behavior in DLC: https://github.com/aws/deep-learning-containers/tree/master/pytorch/inferenceWe can ignore zombie processes when detecting the presence of a running TorchServe process. Reference: https://psutil.readthedocs.io/en/latest/#psutil.ZombieProcess
Tests:
Torchserve continues to run and container does not get terminated.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.