Ignore zombie processes when detecting TorchServe status

namannandan commented 2 months ago

Description of changes: When checking to see if the TorchServe process is running, we iterate through the current list of running processes using psutil: https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/36a842e374766a088f21906ce17496e88a140a1b/src/sagemaker_pytorch_serving_container/torchserve.py#L183-L188

Calling the command() psutil API on a zombie process raises the psutil.ZombieProcess exception. This unhandled exception causes TorchServe to be stopped which is not expected behavior in DLC: https://github.com/aws/deep-learning-containers/tree/master/pytorch/inference

  File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
    if TS_NAMESPACE in process.cmdline():
  File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
    raise ImportError(msg)
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
    ret['name'] = name
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
    stime = float(values['stime']) / CLOCK_TICKS
  File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
    name = decode(name)
psutil.ZombieProcess: PID still exists but it's a zombie (pid=9)

We can ignore zombie processes when detecting the presence of a running TorchServe process. Reference: https://psutil.readthedocs.io/en/latest/#psutil.ZombieProcess

Tests:

CI

Manual testing

Without the fix in this PR

$ docker run --name sagemaker_pt_dlc  -p 8080:8080 --mount type=bind,source=/home/ubuntu/dlc_test/dump,target=/opt/ml/model -v /home/ubuntu/dlc_test/dump:/hostfs -e SAGEMAKER_NGINX_PROXY_READ_TIMEOUT_SECONDS=600 -e SAGEMAKER_MODEL_SERVER_WORKERS=1 -e OMP_NUM_THREADS=1 -e DNNL_VERBOSE=1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker serve
.....
.....
2024-05-31T17:49:14,665 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:2257.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:c988cd31b0a0,timestamp:1717177754
2024-05-31T17:49:14,665 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:c988cd31b0a0,timestamp:1717177754
['torchserve', '--start', '--model-store', '/.sagemaker/ts/models', '--ts-config', '/etc/sagemaker-ts.properties', '--log-config', '/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/etc/log4j2.xml', '--models', 'model=/opt/ml/model']
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
serving.main()
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 38, in main
_start_torchserve()
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 257, in call
return attempt.get(self._wrap_exception)
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/serving.py", line 34, in _start_torchserve
torchserve.start_torchserve(handler_service=HANDLER_SERVICE)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 102, in start_torchserve
ts_process = _retrieve_ts_server_process()
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 266, in call
raise attempt.get()
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/conda/lib/python3.10/site-packages/six.py", line 719, in reraise
raise value
File "/opt/conda/lib/python3.10/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/conda/lib/python3.10/site-packages/sagemaker_pytorch_serving_container/torchserve.py", line 187, in _retrieve_ts_server_process
if TS_NAMESPACE in process.cmdline():
File "/opt/conda/lib/python3.10/site-packages/psutil/__init__.py", line 719, in cmdline
return self._proc.cmdline()
File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1714, in wrapper
return fun(self, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1853, in cmdline
self._raise_if_zombie()
File "/opt/conda/lib/python3.10/site-packages/psutil/_pslinux.py", line 1758, in _raise_if_zombie
raise ZombieProcess(self.pid, self._name, self._ppid)
psutil.ZombieProcess: PID still exists but it's a zombie (pid=10)

With the fix in this PR

$ docker run -it 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.1.0-cpu-py310-ubuntu20.04-sagemaker /bin/bash
$ git clone https://github.com/namannandan/sagemaker-pytorch-inference-toolkit.git
$ cd sagemaker-pytorch-inference-toolkit
$ git checkout psutil-zombie-fix
$ pip install .
.....
$ docker container commit ca81ce0ac2c7 test:psutil-fix
$ docker run --name sagemaker_pt_dlc  -p 8080:8080 --mount type=bind,source=/home/ubuntu/dlc_test/dump,target=/opt/ml/model -v /home/ubuntu/dlc_test/dump:/hostfs -e SAGEMAKER_NGINX_PROXY_READ_TIMEOUT_SECONDS=600 -e SAGEMAKER_MODEL_SERVER_WORKERS=1 -e OMP_NUM_THREADS=1 -e DNNL_VERBOSE=1 test:psutil-fix serve
.....
.....
2024-05-31T18:00:40,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-05-31T18:00:40,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.10.9
2024-05-31T18:00:40,502 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-05-31T18:00:40,507 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-05-31T18:00:40,511 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1717178440511
2024-05-31T18:00:40,537 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-05-31T18:00:41,286 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - /opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
2024-05-31T18:00:41,287 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   return self.fget.__get__(instance, owner)()
2024-05-31T18:00:41,456 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - /opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
2024-05-31T18:00:41,456 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   warnings.warn(
2024-05-31T18:00:41,460 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 949
2024-05-31T18:00:41,461 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:2217.0|#WorkerName:W-9000-model_1.0,Level:Host|#hostname:09c62ecb1844,timestamp:1717178441
2024-05-31T18:00:41,461 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178441
2024-05-31T18:01:39,547 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:456.5026435852051|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:39.51191711425781|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,548 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:8.0|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:61538.0625|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1350.4765625|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
2024-05-31T18:01:39,549 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.3|#Level:Host|#hostname:09c62ecb1844,timestamp:1717178499
.....
.....

Torchserve continues to run and container does not get terminated.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

visinfo commented 2 months ago

@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

namannandan commented 2 months ago

@namannandan should we just check the process status rather than swallowing the exception ? see reference https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276-L277

Thanks @visinfo that makes sense, updated the PR.

adrien-code-it commented 2 months ago

I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0. I saw here that the fix was merged in aws:master two days ago, however my latest deployment still fails during torchserve with the error: psutil.ZombieProcess: PID still exists but it's a zombie.

When this fix will be available for deploying models ? Regards

5agado commented 2 months ago

As for @adrien-code-it , I also tried on a new model and 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker as well as 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-inference:2.2.0-gpu-py310-cu118-ubuntu20.04-sagemaker, and still get the error.

@namannandan, @visinfo is there something we need to do to deploy using the update? Or when will it be distributed to all instances?

adrien-code-it commented 2 months ago

@5agado I was able to deploy my model by adding a requirements.txt file alongside my inference.py file, and specify to pip install the latest sagemaker-pytorch-inference-toolkit :
git+https://github.com/aws/sagemaker-pytorch-inference-toolkit.git

Although it's not a permanent solution (I would prefer pulling a fixed version, not the latest), it's working as of now. Moreover, when deploying, it still fails once, but then it succeeds the second time..

5agado commented 2 months ago

@adrien-code-it are you deploying the model as endpoint, or using in batch-transform? I tried the same with the latter, but doesn't work for me (I think related to the "succeeds the second time" aspect you mention there)

adrien-code-it commented 2 months ago

@5agado the fix in requirements.txt seems to only work when deploying the model as endpoint (for inference, in my case) :(

For batch-transform, unfortunately I didn't see any fix working... Maybe @namannandan has a solution ?

aws / sagemaker-pytorch-inference-toolkit

Ignore zombie processes when detecting TorchServe status #166