Triton is running too slow?

facebookresearch / generative-recommenders

Repository hosting code used to reproduce results in "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).

Apache License 2.0

624 stars 107 forks source link

Triton is running too slow? #56

Open bzxc opened 1 month ago

bzxc commented 1 month ago

Compared to the same structure(the qkv attention) I implemented with TensorFlow, triton runs 10 to 20 times slower. With the help of nsight system, I found that cudaMemcpySync takes off much time while triton is executing. Would you happen to have any ideas about that?

I feed data like this, batch: 8 seq_len: 8192, where each seq_len are the same size. emb_size = attn_size = linear_size As I changed the data size by a multiplier of 2

On Nvidia A30

bzxc commented 1 month ago

emb_size = attn_size = linear_size = 128, sry for the omission