Closed makaveli10 closed 5 months ago
I tested the docker file and ran a freshly compiled large-v3 engine. This works nicely! :+1:
Thanks for the fix wrt to TensorRT 0.9.0, I actually needed this. It fixes the issue that TensorRT 0.7.1 excessively allocated VRAM (~3x the model). This is not the case anymore with 0.9.0.
One thing that I saw, but this is probably worth a separate PR: The VRAM usage is increasing by ~350mb for every new client connection. I hope this will not be the case if I use the single-model option (#223). I can try that tomorrow.
One thing that I saw, but this is probably worth a separate PR: The VRAM usage is increasing by ~350mb for every new client connection. I hope this will not be the case if I use the single-model option (#223). I can try that tomorrow.
I could verify that with the single model option, there is no memory increase. So this seems to be an issue with model cleanup and re-instantiation only.
@peldszus thanks for confirming here, will look into that when we instantiate new model for every new user.
pip install tensorrt_llm
instead of buildingtensorrt_llm
from scratch in Dockerfile