Normally, for production on pegasus, we should be running assuming we have at least 18GB of RAM, and most of the CPU cores. However, I'm proposing we create three different profiles we can switch to (e.g., we might run MINGPU on unicorn, but HIGHQUALITY on pegasus):
HIGHQUALITY: Assuming we have all of a high-end machine, focused on highest quality results for a small number of queries at a time. This is what we should run for CSUN, for example, on the assumption that we may get considerable use, but probably not a huge number of simultaneous queries within any 5s window.
Target max time for a single query: 5s
MULTIUSER: Assuming we are getting multiple queries simultaneously. Reduce result quality as necessary to support n simultaneous queries without an OOM condition. n=5 is probably a good place to start (avg 1 query/s). May have to switch to this during CSUN if usage is higher than expected.
Target max time for a single query: 5s
LOWGPU: Minimize or completely eliminate GPU use. Can be used for testing on local machine for debugging, or a low-end server. Time may get long, and quality may be significantly reduced.
Target max time for a single query: 10s
Levers we can adjust:
Model sizes
choosing to run some models on CPU rather than GPU
Reducing graphic size before it even gets to the preprocessors
Reducing TTS quality
Reducing audio spatialization quality
Turning off some handlers (e.g., semseg)
???
Future extensions:
move to multiple servers and load balancing
Don't know how feasible, but dynamically moving between these levels based on system load would better allow max quality when system is not fully loaded, and reduce quality automatically when many requests are coming in
Normally, for production on pegasus, we should be running assuming we have at least 18GB of RAM, and most of the CPU cores. However, I'm proposing we create three different profiles we can switch to (e.g., we might run MINGPU on unicorn, but HIGHQUALITY on pegasus):
HIGHQUALITY: Assuming we have all of a high-end machine, focused on highest quality results for a small number of queries at a time. This is what we should run for CSUN, for example, on the assumption that we may get considerable use, but probably not a huge number of simultaneous queries within any 5s window. Target max time for a single query: 5s
MULTIUSER: Assuming we are getting multiple queries simultaneously. Reduce result quality as necessary to support n simultaneous queries without an OOM condition. n=5 is probably a good place to start (avg 1 query/s). May have to switch to this during CSUN if usage is higher than expected. Target max time for a single query: 5s
LOWGPU: Minimize or completely eliminate GPU use. Can be used for testing on local machine for debugging, or a low-end server. Time may get long, and quality may be significantly reduced. Target max time for a single query: 10s
Levers we can adjust:
Future extensions: