Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
14.06k stars 1.31k forks source link

question on the block table #1314

Open chakpongchung opened 1 day ago

chakpongchung commented 1 day ago

https://github.com/Dao-AILab/flash-attention/blob/478ee666cccbd1b8f63648633003059a8dc6827d/tests/test_flash_attn.py#L2066

Could you elaborate more on the block table argument here? I am trying to find an example to show how flash_attn_with_kvcache should be used. Specifically, why user needs to care about the logical block/physical block mapping when passing the KV cache since it is updated in place by this function? How should we construct the block table based on the KV shape? I assume the block table shape is dependent on the KV shape only.

tridao commented 23 hours ago

You can read the function docstring https://github.com/Dao-AILab/flash-attention/blob/478ee666cccbd1b8f63648633003059a8dc6827d/flash_attn/flash_attn_interface.py#L1492

Typically this is used with a cache manager (e.g. in vLLM) that manages when to allocate & free the blocks and construct the block table. Such cache manager is not implemented here since it depends on how you build the inference engine.