Flash Triton - Githubissues

HPPinata commented 3 weeks ago

A newer (and less janky) version of flash_attn.

A bit more testing is required around changing the PyTorch version to 2.5.0 without breaking older setups.

Potential improvements:

Skip installing flash_attn on compatiple cards if FLASH_ATTENTION_USE_TRITON_ROCM=FALSE (currently it's not build on old cards no matter what, but has to be set to true to avoid errors on compatible GPUs)
Test 256 head dimensions (should be supported), might make FLUX.1 able to use it???
Test if Triton makes the use of flash_attn possible on older RDNA cards possible

HPPinata commented 2 weeks ago

This might still need a bit more work, hold on for now

HPPinata commented 2 weeks ago

@tazlin This should now be fine for a merge. Performance is as expected, memory usage slightly better than the present implementation and conda performance is roughly in line with the docker version. Stability is also improved, I'm getting a <1% process recovery rate.

Once support is merged upstream this will get another minor rework, but that might be months off.

Haidra-Org / horde-worker-reGen

Flash Triton #333