Closed loubnabnl closed 1 year ago
Hi, the standard/uniform baseline is also trained with a batch size of nbt = 32. The large batch NBt = 320 but we always train on 10% of that data, even when the 10% are uniform randomly selected. So in N training steps, the uniform model and the rholoss model have seen equally many points.
On Sat, Dec 17, 2022 at 1:26 PM Loubna Ben Allal @.***> wrote:
Hi, I have a question about the training steps you use to compare the standard model (just with shuffling) to the target model (trained with rho-loss selection). If I understood correctly from this codebase, you train the standard model with batch size of 320, and in the target model training, each gradient step corresponds to the 32 selected samples.
So in N training steps, the standard model will have seen 10x more tokens than the target model? Why don’t we compare models after a fixed number of seen tokens, in this case the standard model training should use a batch size of 32 too.
— Reply to this email directly, view it on GitHub https://github.com/OATML/RHO-Loss/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCVMVC7LRJNCUI7DFTZT4TWNW5RBANCNFSM6AAAAAATB6MRAQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Ok I see thanks! I assumed it was 320 based on the batch size in this config.
Do you have an idea of how the method generalizes for a high selection percentage (in the paper you tested a max of 20%) because in LM pre-training for example the standard batch size is usually large e.g 256/512 and it would be computationally expensive to pre-sample 2500/5120 examples each time
We haven't played with the selection percentage much but generally I'd expect lower speedups. But 10% is quite low and probably in some cases too low, leading to instability. So a higher percentage makes sense to try.
On Sat, Dec 17, 2022 at 9:31 PM Loubna Ben Allal @.***> wrote:
Ok I see thanks! I assumed it was 320 based on the batch size in this config https://github.com/OATML/RHO-Loss/blob/4c88851742ce5397153f4fef80abd4682958ac56/configs/standard_training.yaml#L19 .
Do you have an idea of how the method generalizes for a high selection percentage (in the paper you tested a max of 20%) because in LM pre-training for example the standard batch size is usually large e.g 256/512 and it would be computationally expensive to pre-sample 2500/5120 examples each time
— Reply to this email directly, view it on GitHub https://github.com/OATML/RHO-Loss/issues/5#issuecomment-1356476110, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCVMVAOCPVUXXOZ5A7APMTWNYWKVANCNFSM6AAAAAATB6MRAQ . You are receiving this because you commented.Message ID: @.***>
@loubnabnl
@SoerenMind is right that we didn't play around too much with the selection percentage, but we did do a little ablation. You can see that in appendix F of the paper. Also attaching a picture of that below. As you can see the impact varies, and likely depends on, amongst others, the dataset size, dataset composition and batch sizes.
Hi, I have a question about the training steps you use to compare the standard model (just with shuffling) to the target model (trained with rho-loss selection). If I understood correctly from this codebase, you train the standard model with batch size of 320, and in the target model training, each gradient step corresponds to the 32 selected samples.
So in N training steps, the standard model will have seen 10x more tokens than the target model? Why don’t we compare models after a fixed number of seen tokens, in this case the standard model training should use a batch size of 32 too.