basetenlabs / truss

The simplest way to serve AI/ML models in production
https://truss.baseten.co
MIT License
912 stars 71 forks source link

Any processes started in the load function of Truss are not available for predict #464

Open pankajroark opened 1 year ago

pankajroark commented 1 year ago

Load function runs on a separate thread, any processes created there die when the thread exits, which is immediately after the load function finishes. Some models such as those using vllm rely upon running model on a separate process. When these processes get killed after load the model is not longer available for prediction, which fail.

pankajroark commented 1 year ago

One solution could be to keep the model load thread around for the lifetime of the inference server process. The thread should exit to allow for graceful termination. This can be done by the thread waiting on a queue, where a termination message could be posted.

If we could figure out a way to exit the thread without the child processes dying that will be better, as the thread has no use after load is done.