Closed rexdu2003 closed 9 months ago
Hi, this code-base is not really set up for autoregressive modeling, but yeah, changing the attention to be causal (btw there are remnants of a causal attention mask implemented, but I would check whether that works as expected) and changing the objective would be sufficient, alongside with changing the dataloader.
Thanks! I will take a try.
Thanks for your great jobs! I want to compare BERT with GPT under the same model size setting, so I wonder if there are any configs for training a GPT-like model. Is it enough to just remove the mask token in the input and change the attention mask and prediction target accordingly?