facebookresearch / generative-recommenders

Repository hosting code used to reproduce results in "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).
Apache License 2.0
745 stars 141 forks source link

[Question] attention score scaling and gr_output_length #79

Closed JacoCheung closed 2 months ago

JacoCheung commented 2 months ago

Hi team. In hstu, attention scores are calculated via silu + scaling ,

$$ \text{SILU}(q\cdot k) / n $$

where n=historical length + gr_output_length + 1. However, the $q, k$ here only contain valid historical ids (padded length and gr_output_length + 1 are masked out). So I wondered the denominator should be rectified as a tensor, i.e: past_length

And I found out that even though the input sequence are padded from historical length to historical length + gr_output_length + 1, the padded ids are all masked out during the whole processing. So what's the point of adding empty gr_output_length + 1 ids to the historical ids?

jiaqizhai commented 2 months ago

Hi,

For denominator selection, we've found custom denominators (eg what you suggested being one possible version) to help in some cases / w.r.t. certain losses. In our triton code, attn_scale provides a way to experiment with this direction. We do not have that implemented in pytorch as it's not used for the majority of our use cases, but it should be easy to modify PT code to add that if you've found it helpful.

For padding, we indeed do not use gr_output_length + 1 in this version of the codebase. Given we target long sequences, the +11 normalization factor should not change experiment results much.