Open 982118809 opened 3 months ago
Are you using the latest commit? There's a recent update to enable causal for the backward. Can you profile to get the time for the attention kernel?
When I started to configure the environment, I also encountered problem #1091 . After fixing this issue, tests were conducted following the successful configuration around August 1st. When was the new commit submitted you mentioned? Was I using the latest commit?
OK, I'll use this commit to test it again.
OK, I'll use this commit to test it again.
How about the performance? When I pretrain deepseek-v2 in H100-80G, I met the same(FA3 is slower than FA2)
Can you profile to get the time for the attention kernel?
OK, I'll use this commit to test it again.
How about the performance? When I pretrain deepseek-v2 in H100-80G, I met the same(FA3 is slower than FA2)
Sorry, I'm busy with other things recently. We may wait until FA3 is officially released before using it.
Same issue when finetuning both llama3 and qwen2 model. FA3 takes more time and slightly more GPU space(not sure) than FA2. I replace the same function flash_attn_varlen_func
in transformers/modeling_flash_attention_utils.py from FA2 to FA3. Maybe it is not a right way :(
Hello,I applied FA3 in the fine-tuning of the qwen2 model, using an H800 machine. The test was slower than FA2 under the same conditions.
I used FlashAttnFunc.forward in hopper/flash_attn_interface.py file to replace Qwen2Attention.forward. In the flash_attn_interface.py file added:
Then, turn off
attn_implementation="flash_attention_2"
in the fine-tuning code and import the modified part.Using: 1*H800-80G, 32 cpu, 256 memory qwen2-7b, 45k data, 6.5k training length
In FA3, the speed is about 34s/it
but in FA2, the speed is about 24s/it
And no much difference in memory usage was observed.
May I ask if I did something wrong? Thank you.