Closed peldszus closed 5 months ago
@peldszus Thanks for putting this together. I think it would be useful to make the single model mode as the default model?
I think it would be useful to make the single model mode as the default model?
It probably depends on your intended audience, but I agree.
For private users with no, or just one GPU, single model mode is probably the best option.
In production environments with higher throughput, maybe with multiple GPUs on one node, there are different scaling / optimizations routes to go, but they could involve single model mode as well.
@makaveli10 If this is fine for you, I can adjust the argument parser and the readme.
@makaveli10 Have a look now, I updated the option defaultness and the readme correspondingly.
Looks good to me. I would add an option to use single model if not using a custom model as well but I guess for a future realease because that is a bit more complicated in terms of maintaining a dict of models with model_sizes that have been instantiated and clearing it up if no client is using that model size.
I added a mode in which all client connections use the same single model, instead of instantiating a new model for each connection. This only applies if a custom model has been specified at server start (i.e. a trt model or a custom fw model).
For this a new option has been added, defaulting to false, so that the current behaviour is not changed.
This partially resolves #109, but only for custom models. It does not apply for a fw-backend which dynamically loads standard-models based on the client request.
A thread lock is used, to make model prediction thread safe. But this also means that connections have to wait, if another connection currently predicts.
Motivation
I use a large-v3 tensorrt model. It would take 5secs to load for every new client connection. With the single model option, this is reduced to <1sec. Also, I only want to have the model in VRAM once.