JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.3k stars 100 forks source link

Configs for GPT? #41

Closed rexdu2003 closed 9 months ago

rexdu2003 commented 9 months ago

Thanks for your great jobs! I want to compare BERT with GPT under the same model size setting, so I wonder if there are any configs for training a GPT-like model. Is it enough to just remove the mask token in the input and change the attention mask and prediction target accordingly?

JonasGeiping commented 9 months ago

Hi, this code-base is not really set up for autoregressive modeling, but yeah, changing the attention to be causal (btw there are remnants of a causal attention mask implemented, but I would check whether that works as expected) and changing the objective would be sufficient, alongside with changing the dataloader.

rexdu2003 commented 9 months ago

Thanks! I will take a try.