basetenlabs / truss

The simplest way to serve AI/ML models in production
https://truss.baseten.co
MIT License
892 stars 64 forks source link

add kv cache free gpu mem fraction support to engine builder flow #993

Closed pankajroark closed 3 months ago

pankajroark commented 3 months ago

At very high batch sizes the default of 0.9 is not sufficient because it doesn't leave enough gpu memory for non-kv cache use cases, we need to be able to lower it.

Testing: Tested manually on dev.