awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
994 stars 231 forks source link

Ensure workers get killed on unregister call #942

Open maheshambule opened 4 years ago

maheshambule commented 4 years ago

Issue #, if available:

The orphan processes get created when you fire multiple register and unregister calls on same model one after another. The orphan worker processes hogs the system memory.

Description of changes:

  1. Send sigterm to Main Worker Thread from frontend by using destroy call instead of destroyForcefully.
  2. Handle sigterm and kill all the child workers and current process.
  3. Add SIGCHLD handler to handle zombie processes. Code taken from here: https://github.com/maaquib/multi-model-server/commit/6ed099a203f1bb330982f51e5ea29983bdd78bc2#diff-efa92912588641e6a3b20d8900316be2R167

Testing done:

To run CI tests on your changes refer README.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dhanainme commented 3 years ago

@maheshambule , Can you please do a small write regarding the testing that was done for this PR.

Please add some context / behavior before the fix / behavior after the fix with steps & logs for both the cases.

kastman commented 2 years ago

I'm relatively sure I'm seeing this behavior as well - memory used on invocation/handle doesn't seem to be garbage collected and I'm quickly hitting 100% even on relatively large 3 x 18xlarge instances in the context of Sagemaker multi-model endpoints. The changes from @maheshambule were approved but never merged - is that because it needed documentation? @ayushsengupta1991 @dhanainme I'm happy to write something if it would get the code accepted upstream

kastman commented 2 years ago

(The CI failure isn't accessible anymore or I'd go in and comment.) Hoping to get this bumped / helped. Thanks,