Open ex3ndr opened 3 months ago
Changing to 16x16 head dimensions reduces gap to 10x, but still very slow.
Please don't use time.time() to measure time. CUDA operations are async. You can use torch benchmark. https://pytorch.org/tutorials/recipes/recipes/benchmark.html
@tridao Thank you for catching that, after the fix it is still 4x slower than flash_attn_func:
xformers (mask) 0.00034342713200021533 xformers (no mask) 0.0013367030000081285 torch (mask) 0.0034441131959902123 torch (no mask) 0.0013596494959783741 flash (mask) 0.00034348745294846597 flash (no mask) 0.0013394619610044174
@ex3ndr You should add warmup runs like this. https://github.com/triton-lang/triton/blob/fd0fa8305c8626dd77cf588336ccdceabe7d8230/python/triton/testing.py#L144
@zhangjun Thanks! But i am running this code in notebook and repeating cell execution yields similar results.
Hello! I am benchmarking attention implementations, and trying to use flash attention for my variable length data and for some reason var length is much much slower than any other implementation, i am testing on 4090. No matter how much i am warming, or retrying it is always ~same results.
Benchmark results:
This is the code: