Closed versae closed 3 months ago
Attention backends are selected here: https://github.com/JonasGeiping/cramming/blob/196c9912d8c5b06a05e9a58edd1521e3d38f7c0c/cramming/architectures/attention.py#L18 and Pytorch / SDPA is an option, although gains through Flash attention are small for the default sequence length of 128. Happy to accept PRs for more backends.
Hi, great project!
Are there any plans to implement/support Flash attention 1, 2, or 3 or SDPA.
Cheers.