Open pankajroark opened 1 year ago
One solution could be to keep the model load thread around for the lifetime of the inference server process. The thread should exit to allow for graceful termination. This can be done by the thread waiting on a queue, where a termination message could be posted.
If we could figure out a way to exit the thread without the child processes dying that will be better, as the thread has no use after load is done.
Load function runs on a separate thread, any processes created there die when the thread exits, which is immediately after the load function finishes. Some models such as those using vllm rely upon running model on a separate process. When these processes get killed after load the model is not longer available for prediction, which fail.