FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.18k stars 548 forks source link

Questions about the intermediate tensor buffers design #92

Open Dazz993 opened 1 year ago

Dazz993 commented 1 year ago

Hi Team! Really nice work!

I am a little bit confused about the design choices related to the intermediate tensor buffers when reading the codes.

  1. Could you explain the purpose of cache_home, cache_read_buf and cache_write_buf? I am wondering why we need multiple buffers (instead of a single one)
  2. I noticed that for the kv cache, there are cache_home, cache_read_buf, and cache_write_buf, but for the hidden states, there is only self.hidden. Could you explain the reason for this difference?
  3. Additionally, I am curious why there is no need to have a cudastream for hidden states' loading and storing.

My basic understanding: When loading the cache, tensor will be copied from cache_home to cache_read_buf and then, when storing the buffer tensor will be copied from write_buf to cache_home. But I don't really understand why we cannot modify them in a single buffer.

These confusions may be due to some special design or necessity in the implementation, or they may be the result of not understanding the code particularly well. I'm very much looking forward to your answers, thanks in advance!