Open ds-kczerski opened 2 days ago
I'm guessing it's because we moved some of the checks and padding (i.e. checking if headdim not a multiple 8) from C++ to Python for compatibility with torch compile. This might add a bit more Python overhead so it's noticable for small batch and short sequences (since the kernel will be very fast there). You can try torch compiling it to reduce the overhead in this case.
What would be helpful is to get the profiler result (e.g. pytorch profiler or nsight systems) to see the kernel time. e.g. if the kernel time stays the same then we can say it's because of Python overhead. If the kernel time is very different then we'll need to investigate.
Hey, thanks for the quick reply!
I’ve been profiling with nsys on the A100 and can conclude that it’s likely Python overhead, as the kernel times appear identical for both versions 2.6.3 and 2.7.0post2. I’m checking forward/backward passes for the same dimensions as mentioned earlier. Unfortunately, it seems that Python overhead becomes quite significant, especially when targeting smaller Q/K lengths and/or batch sizes.
You can try torch compiling it to reduce the overhead in this case.
Yeah, we should introduce it as a baseline I guess. Will test it soon. ATM, this thread can be closed :) Thanks!
Hey, I have observed in my timing tests that version 2.6.3 is faster than some later commits (including 2.7.0.post2) for below input sizes. For example, for small batch sizes (==2) and relatively small sequences, 2.6.3 is even 2x faster for me in the forward pass.
My setup: 4070 Laptop (CUDA 12) and A100 (CUDA 11), Torch 2.4. Both flash-attn versions were installed via pip install directly from PyPI. Below are results measured with a custom Python script with proper CUDA synchronization.
Minimal instructions to replicate:
test_min_example.py
Could you please help me understand what might be the source of these timing differences? When going through the source code, it seems to me that the kernel code is the same, the CUTLASS submodule repo pointer is the same, and the only changes are in the API in C++/Python, which relate to head, head_size_og, and padding. Also, my embedding sizes and head numbers are divisible by 8.