Closed vivianrwu closed 1 month ago
Do we need to update unit tests?
Unit tests do not need to be updated because it is on the condition of engine.warm
QQ on the description,
- we set the max pdbs when we start the server, this value should be within memory cap (based on calculation w the devices used), then it would not OOM right?
Yes, I think the storage of the compiled graphs from AOT and executing it from AOT is what takes up the memory. We observe the OOM at generate request.
- Why higher actual batch size would have very slow detokenization? Could you share some investigation or profiles?
Yes, you can reference https://github.com/google/JetStream/pull/92 for some investigations. Also shared the doc internally.
ified that the detokenizing generate step time remains same as JetStream optimal behavior for all batch sizes.
Did you figure out what is the root cause of performance issue and OOM for AOT?
ified that the detokenizing generate step time remains same as JetStream optimal behavior for all batch sizes.
Did you figure out what is the root cause of performance issue and OOM for AOT?
RCA has been attempted and the root cause of OOM can potentially be the added space to save the compiled graphs in executables alongside saving the cache in the compilation cache directory. The performance issue, has not been concluded. Could be unoptimal AOT executables. I can share the investigation offline
Use manual model warmup instead of AOT implemented model warmup, since with AOT, we observe performance degradation at higher batch size of maxtext configuration, mentioned in https://github.com/google/JetStream/pull/92:
This has been verified that the detokenizing generate step time remains same as JetStream optimal behavior for all batch sizes.