Closed JacoCheung closed 2 months ago
Hi,
For denominator selection, we've found custom denominators (eg what you suggested being one possible version) to help in some cases / w.r.t. certain losses. In our triton code, attn_scale provides a way to experiment with this direction. We do not have that implemented in pytorch as it's not used for the majority of our use cases, but it should be easy to modify PT code to add that if you've found it helpful.
For padding, we indeed do not use gr_output_length + 1 in this version of the codebase. Given we target long sequences, the +11 normalization factor should not change experiment results much.
Hi team. In hstu, attention scores are calculated via silu + scaling ,
$$ \text{SILU}(q\cdot k) / n $$
where n=historical length + gr_output_length + 1. However, the $q, k$ here only contain valid historical ids (padded length and gr_output_length + 1 are masked out). So I wondered the denominator should be rectified as a tensor, i.e:
past_length
And I found out that even though the input sequence are padded from historical length to historical length + gr_output_length + 1, the padded ids are all masked out during the whole processing. So what's the point of adding empty gr_output_length + 1 ids to the historical ids?