FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

Pass over README #28

Closed DanFu09 closed 1 year ago

merrymercy commented 1 year ago

@DanFu09 Thanks for adjusting the tone. There are some problems with this PR.

  1. The batch size is wrong. We use different batch sizes for different systems. To compute the batch size of FlexGen, you need to multiply the GPU batch size with the number of GPU batches, so it is not simply "24, 72, and 20 for 6.7B, 30B, 175B models. "
  2. Offloading is the key feature of FlexGen. Compression is not. Why do you move Offloading to the last one?