Closed sohviluukkonen closed 1 month ago
Hi, that was defined empirically. Practically we increased L every time we observed a significative change in the slope of the loss. We didn't follow a precise schedule for the input length.
Thank you for your quick answer!
Hi! In your paper shortly mention the procedure of increasing the sequence length once the loss plateaus:
Unfortunately, I could not find either in the paper or the code, how do you define a plateau? Could you give me some information on this?
Thanks!