Closed siyuanseever closed 4 months ago
You may check this section How to choose the group_size and neighbor_window. It's about how to select the two hyperparameters: group size / window size. Different models may have different empirical rules. But anyway, window_size = training_size/2 is too large.
Rerope is a special case of SelfExtend when the group_size = +∞ (or any large enough value) rather than with window_size = training_size - 1 . Also, You may refer How to choose the group_size and neighbor_window for more results. Some of settings in this section is close to rerope. SelfExtend has superiority. We also have discussion in our paper about the relationship of SelfExtend and existing methods such as T5, iRPE and rerope. You can take a look at it for more details.
Differences with ReRoPE