aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
131 stars 70 forks source link

Ignore zombie processes when detecting TorchServe status #166

Closed namannandan closed 2 months ago

namannandan commented 2 months ago

Description of changes: When checking to see if the TorchServe process is running, we iterate through the current list of running processes using psutil: https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/36a842e374766a088f21906ce17496e88a140a1b/src/sagemaker_pytorch_serving_container/torchserve.py#L183-L188

Calling the command() psutil API on a zombie process raises the psutil.ZombieProcess exception. This unhandled exception causes TorchServe to be stopped which is not expected behavior in DLC: https://github.com/aws/deep-learning-containers/tree/master/pytorch/inference

  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    raise ImportError(msg)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    ret['name'] = name
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    stime = float(values['stime']) / CLOCK_TICKS
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    name = decode(name)
psutil.ZombieProcess: PID still exists but it's a zombie (pid=9)

We can ignore zombie processes when detecting the presence of a running TorchServe process. Reference: https://psutil.readthedocs.io/en/latest/#psutil.ZombieProcess

Tests:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

visinfo commented 2 months ago

@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

namannandan commented 2 months ago

@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

Thanks @visinfo that makes sense, updated the PR.

adrien-code-it commented 2 months ago

I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0. I saw here that the fix was merged in aws:master two days ago, however my latest deployment still fails during torchserve with the error: psutil.ZombieProcess: PID still exists but it's a zombie.

When this fix will be available for deploying models ? Regards

5agado commented 2 months ago

As for @adrien-code-it , I also tried on a new model and 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker as well as 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker, and still get the error.

@namannandan, @visinfo is there something we need to do to deploy using the update? Or when will it be distributed to all instances?

adrien-code-it commented 2 months ago

@5agado I was able to deploy my model by adding a requirements.txt file alongside my inference.py file, and specify to pip install the latest sagemaker-pytorch-inference-toolkit :
git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git

Although it's not a permanent solution (I would prefer pulling a fixed version, not the latest), it's working as of now. Moreover, when deploying, it still fails once, but then it succeeds the second time..

5agado commented 2 months ago

@adrien-code-it are you deploying the model as endpoint, or using in batch-transform? I tried the same with the latter, but doesn't work for me (I think related to the "succeeds the second time" aspect you mention there)

adrien-code-it commented 2 months ago

@5agado the fix in requirements.txt seems to only work when deploying the model as endpoint (for inference, in my case) :(

For batch-transform, unfortunately I didn't see any fix working... Maybe @namannandan has a solution ?