I am a little bit confused about the design choices related to the intermediate tensor buffers when reading the codes.
Could you explain the purpose of cache_home, cache_read_buf and cache_write_buf? I am wondering why we need multiple buffers (instead of a single one)
I noticed that for the kv cache, there are cache_home, cache_read_buf, and cache_write_buf, but for the hidden states, there is only self.hidden. Could you explain the reason for this difference?
Additionally, I am curious why there is no need to have a cudastream for hidden states' loading and storing.
My basic understanding:
When loading the cache, tensor will be copied from cache_home to cache_read_buf and then, when storing the buffer tensor will be copied from write_buf to cache_home. But I don't really understand why we cannot modify them in a single buffer.
These confusions may be due to some special design or necessity in the implementation, or they may be the result of not understanding the code particularly well. I'm very much looking forward to your answers, thanks in advance!
Hi Team! Really nice work!
I am a little bit confused about the design choices related to the intermediate tensor buffers when reading the codes.
cache_home
,cache_read_buf
andcache_write_buf
? I am wondering why we need multiple buffers (instead of a single one)cache_home
,cache_read_buf
, andcache_write_buf
, but for the hidden states, there is onlyself.hidden
. Could you explain the reason for this difference?My basic understanding: When loading the cache, tensor will be copied from
cache_home
tocache_read_buf
and then, when storing the buffer tensor will be copied fromwrite_buf
tocache_home
. But I don't really understand why we cannot modify them in a single buffer.These confusions may be due to some special design or necessity in the implementation, or they may be the result of not understanding the code particularly well. I'm very much looking forward to your answers, thanks in advance!