Open 2021270902001sc opened 11 months ago
The results in the paper are with torch 1.x and no FlashAttention (unless explicitly mentioned). The released code has plenty of performance optimizations and is significantly faster than what is reported in the paper.
Does your result include an ablation of the operator properties of torch? I have noticed that the model operation speed of transformer on torch2.0 is significantly faster than that of torch1.x