aws / sagemaker-inference-toolkit

Serve machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
370 stars 82 forks source link

Fix zombie process exception #133

Closed sachanub closed 8 months ago

sachanub commented 8 months ago

Issue #, if available:

Description of changes:

Due to upgrade of psutil version from 5.9.5 to 5.9.6, customers started facing this error:

"Traceback (most recent call last):
  File ""/usr/local/bin/dockerd-entrypoint.py"", line 21, in <module>
    main()
  File ""/usr/local/bin/dockerd-entrypoint.py"", line 13, in main
    _start_model()
  File ""/usr/local/bin/dockerd-entrypoint.py"", line 9, in _start_model
    model_server.start_model_server(handler_service=""/home/model-server/model_files/model_handler.py:handle"")
  File ""/home/model-server/venv/lib/python3.8/site-packages/sagemaker_inference/model_server.py"", line 102, in start_model_server
    mms_process = _retry_retrieve_mms_server_process(env.startup_timeout)
  File ""/home/model-server/venv/lib/python3.8/site-packages/sagemaker_inference/model_server.py"", line 209, in _retry_retrieve_mms_server_process
    return retrieve_mms_server_process()
  File ""/home/model-server/venv/lib/python3.8/site-packages/retrying.py"", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File ""/home/model-server/venv/lib/python3.8/site-packages/retrying.py"", line 212, in call
    raise attempt.get()
  File ""/home/model-server/venv/lib/python3.8/site-packages/retrying.py"", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File ""/home/model-server/venv/lib/python3.8/site-packages/six.py"", line 719, in reraise
    raise value
  File ""/home/model-server/venv/lib/python3.8/site-packages/retrying.py"", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File ""/home/model-server/venv/lib/python3.8/site-packages/sagemaker_inference/model_server.py"", line 216, in _retrieve_mms_server_process
    if MMS_NAMESPACE in process.cmdline():
  File ""/home/model-server/venv/lib/python3.8/site-packages/psutil/__init__.py"", line 702, in cmdline
    return self._proc.cmdline()
  File ""/home/model-server/venv/lib/python3.8/site-packages/psutil/_pslinux.py"", line 1650, in wrapper
    return fun(self, *args, **kwargs)
  File ""/home/model-server/venv/lib/python3.8/site-packages/psutil/_pslinux.py"", line 1788, in cmdline
    self._raise_if_zombie()
  File ""/home/model-server/venv/lib/python3.8/site-packages/psutil/_pslinux.py"", line 1693, in _raise_if_zombie
    raise ZombieProcess(self.pid, self._name, self._ppid)"

From version 5.9.6, psutil started correctly raising ZombieProcess on Process.exe(), Process.cmdline() and Process.memory_maps() instead of returning a "null" value. As a result, customers started facing the ZombieProcess exception due to this line in model_server.py: https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276

The proposed fix is as follows:

While iterating through the processes via psutil.process_iter(), before checking for the presence of MMS_NAMESPACE in process.cmdline(), we check the status of the process. If it has the zombie status, we skip checking the presence of MMS_NAMESPACE in process.cmdline() for that process, avoiding the ZombieProcess exception.

Testing done:

Launched a BYOC container (utilizes sagemaker_inference in the entrypoint to start MMS). Without the fix, the container stops after 5-10 minutes with the above mentioned exception. With the above mentioned exception, the container did not stop (I kept it running for a few hours).


UPDATE: Added 60 second sleep in the local integration tests after container is started to fully allow the model server to start up. This was needed to avoid failure of local integration tests since they were run before the model server could even finish starting.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

sagemaker-bot commented 8 months ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 8 months ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 8 months ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 8 months ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 8 months ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository