@DanFu09
Thanks for adjusting the tone. There are some problems with this PR.
The batch size is wrong. We use different batch sizes for different systems. To compute the batch size of FlexGen, you need to multiply the GPU batch size with the number of GPU batches, so it is not simply "24, 72, and 20 for 6.7B, 30B, 175B models. "
Offloading is the key feature of FlexGen. Compression is not. Why do you move Offloading to the last one?
@DanFu09 Thanks for adjusting the tone. There are some problems with this PR.