Open Sovvv opened 1 month ago
Was wondering if it would be possible to get an option either in command line or in the launch GUI that would allow RTX cards to use the alternate implementation of Flash Attention instead of the one relying on tensor cores?
Bringing this up as I have an RTX 2060 6gb that seems to a performance degradation with the standard implementation. No clue if the alternative would fix the problem, but figure I should put in a report in case there's others who might benefit from it.
I have a RTX 2060 as well and I'm getting higher performance with Flash Attention, provided I'm using full GPU offloading and a context of 8192 tokens for 8B.
Perhaps it's reducing performance in partial offload for you? Partial offload FA is pretty slow right now.
Was wondering if it would be possible to get an option either in command line or in the launch GUI that would allow RTX cards to use the alternate implementation of Flash Attention instead of the one relying on tensor cores?
Bringing this up as I have an RTX 2060 6gb that seems to a performance degradation with the standard implementation. No clue if the alternative would fix the problem, but figure I should put in a report in case there's others who might benefit from it.