Repository hosting code used to reproduce results in "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).
Compared to the same structure(the qkv attention) I implemented with TensorFlow, triton runs 10 to 20 times slower. With the help of nsight system, I found that cudaMemcpySync takes off much time while triton is executing. Would you happen to have any ideas about that?
I feed data like this,
batch: 8
seq_len: 8192, where each seq_len are the same size.
emb_size = attn_size = linear_size
As I changed the data size by a multiplier of 2
Compared to the same structure(the qkv attention) I implemented with TensorFlow, triton runs 10 to 20 times slower. With the help of nsight system, I found that cudaMemcpySync takes off much time while triton is executing. Would you happen to have any ideas about that?
I feed data like this, batch: 8 seq_len: 8192, where each seq_len are the same size. emb_size = attn_size = linear_size As I changed the data size by a multiplier of 2
On Nvidia A30