Closed jayz0123 closed 11 months ago
A new environment variable "FLASH_ATTENTION_INTERNAL_ENABLE_TIME_KERNEL" can switch the output of kernel running time
[BUGs] Previously in older version of FA, we create tensors z and softmax_lse matrix of max sequence lengths with no padding for grouped gemm. But the strides for each batch for the tensors are different. This behaviour will cause wrong result from CK. Fixing it.
Please remove *_hip.hpp
Current Unit Test Result: (PyTorch 2.0.0; ROCm 5.6) 3968 passed, 63 skipped