Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.38k stars 1.22k forks source link

layernorm/rmsnorm is slow #1092

Open pillow37 opened 1 month ago

pillow37 commented 1 month ago

Hi, I use layernorm and rmsnorm in my training pipeline on an A100 and observed via the pytorch profiler that these functions were quite slow. E.g. I measured via time.time() just for the rmsnorm:

The profiler also indicated that there was work done on the CPU, which was somewhat confusing to me.

Do you know what the issue could be?

tridao commented 1 month ago

https://pytorch.org/tutorials/recipes/recipes/benchmark.html Please dont use time.time()