ROCm / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
142 stars 46 forks source link

Increasing the compiling time by spliting into several cpp files #7

Closed dejay-vu closed 1 year ago

dejay-vu commented 1 year ago

This is a tentative PR which has issues on PyTorch 1.13.1 so it is still under development.

Tested the elapsed time of "python setup.py install" on ROCm5.7/PyTorch 1.13.1:

Older version: 26m1.244s

This version: 4m11.111s on PyTorch 1.13.1 3m39.470s on PyTorch 2.0.1

Unit tests passed on ROCm5.7 + PyTorch 1.13.1: docker pull compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:12505_ubuntu20.04_py3.8_pytorch_release-1.13_85fcc08 2113 passed, 2848 skipped in 119.70s

fsx950223 commented 1 year ago

Merge updates and solve conflicts.