AI-Hypercomputer / JetStream

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
Apache License 2.0
202 stars 26 forks source link

Ensure server warmup before benchmark #91

Closed JoeZijunZhou closed 4 months ago

hosseinsarshar commented 4 months ago

I think 10 seconds is too long - 2-5 perfectly worked for me.

liurupeng commented 4 months ago

after Vivian added the AOT support, could we use that to identify if the replica has warmed up?

JoeZijunZhou commented 4 months ago

after Vivian added the AOT support, could we use that to identify if the replica has warmed up?

Yes, that would be the ideal signal to resolve this issue.

JoeZijunZhou commented 4 months ago

Instead of sleep x seconds, can you wait until all the warm up request return all the tokens?

There is a case when the warmup requests done before server warmup complete. Vivian is working on getting the server warmup complete signal from engine. This is a temp workaround.