[Question] attention score scaling and gr_output_length

facebookresearch / generative-recommenders

Repository hosting code used to reproduce results in "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).

Apache License 2.0

745 stars 141 forks source link

Hi team. In hstu, attention scores are calculated via silu + scaling ,

$$ \text{SILU}(q\cdot k) / n $$

where n=historical length + gr_output_length + 1. However, the $q, k$ here only contain valid historical ids (padded length and gr_output_length + 1 are masked out). So I wondered the denominator should be rectified as a tensor, i.e: past_length

And I found out that even though the input sequence are padded from historical length to historical length + gr_output_length + 1, the padded ids are all masked out during the whole processing. So what's the point of adding empty gr_output_length + 1 ids to the historical ids?

facebookresearch / generative-recommenders

[Question] attention score scaling and gr_output_length #79