Open LiuXiaoxuanPKU opened 3 weeks ago
What's "context length" here? Which variable?
It will affect the value max_seqlen_k
.
https://github.com/vllm-project/vllm/blob/66e832be41cd3f29bd2b37303ea5944efcb16204/tests/kernels/test_flash_attn.py#L258
max_seqlen_k is a variable on CPU. After the kernel is capture, changing this value will have no effect.
It's similar to other variables on CPU, such as softmax_scale. If the kernel is captured with softmax_scale = 1.0, then after that if you change softmax_scale to 2.0 and replay the kernel, it would work as if softmax_scale=1.0.
Thanks!
Then it's a bit wire. We did check the input shape of all other variables (q
, k
, v
, cu_seqlens_q
, cu_seqlens_k
, and block_table
). What's the bets way to debug it?
You're trying to change a CPU variable after capturing CUDA graph, that's not supported by CUDA graph. I haven't looked closely but looks like in this case the kernel is behaving as expected. Can you describe what behavior you expect?
Thanks for the great repo! We are testing the correctness of
flash_attn_varlen_func
when enabling the cuda graph. This is the test we use. https://github.com/vllm-project/vllm/blob/66e832be41cd3f29bd2b37303ea5944efcb16204/tests/kernels/test_flash_attn.py#L234 We found that the value of context length will affect the correctness, even if the shapes of all input parameters are the same. This is required because we don't know the context length during the graph capture time. Any hints on solving the problem is highly appreciated.