add kv cache free gpu mem fraction support to engine builder flow

basetenlabs / truss

The simplest way to serve AI/ML models in production

https://truss.baseten.co

MIT License

892 stars 64 forks source link

add kv cache free gpu mem fraction support to engine builder flow #993

Closed pankajroark closed 3 months ago

pankajroark commented 3 months ago

At very high batch sizes the default of 0.9 is not sufficient because it doesn't leave enough gpu memory for non-kv cache use cases, we need to be able to lower it.

Testing: Tested manually on dev.