Increasing the compiling time by spliting into several cpp files

~~This is a tentative PR which has issues on PyTorch 1.13.1 so it is still under development.~~

Tested the elapsed time of "python setup.py install" on ROCm5.7/PyTorch 1.13.1:

Older version: 26m1.244s

This version: 4m11.111s on PyTorch 1.13.1 3m39.470s on PyTorch 2.0.1

Unit tests passed on ROCm5.7 + PyTorch 1.13.1: docker pull compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:12505_ubuntu20.04_py3.8_pytorch_release-1.13_85fcc08 2113 passed, 2848 skipped in 119.70s

ROCm / flash-attention

Increasing the compiling time by spliting into several cpp files #7

Tested the elapsed time of "python setup.py install" on ROCm5.7/PyTorch 1.13.1:

Older version: 26m1.244s

This version: 4m11.111s on PyTorch 1.13.1 3m39.470s on PyTorch 2.0.1