Open chakpongchung opened 1 day ago
You can read the function docstring https://github.com/Dao-AILab/flash-attention/blob/478ee666cccbd1b8f63648633003059a8dc6827d/flash_attn/flash_attn_interface.py#L1492
Typically this is used with a cache manager (e.g. in vLLM) that manages when to allocate & free the blocks and construct the block table. Such cache manager is not implemented here since it depends on how you build the inference engine.
https://github.com/Dao-AILab/flash-attention/blob/478ee666cccbd1b8f63648633003059a8dc6827d/tests/test_flash_attn.py#L2066
Could you elaborate more on the block table argument here? I am trying to find an example to show how
flash_attn_with_kvcache
should be used. Specifically, why user needs to care about the logical block/physical block mapping when passing the KV cache since it is updated in place by this function? How should we construct the block table based on the KV shape? I assume the block table shape is dependent on the KV shape only.