Open jzhang38 opened 1 year ago
Hi,
Thank you for your question! We include both models in Figure 3. The red curve, which is rather close to our DiffusionBERT stands for an AR model trained from scratch and the green one for finetuned GPT2. In general, DiffusionBERT still falls behind pretrained AR models in terms of generation quality.
Best, Zhengfu
Dear authors,
Thanks for open-sourcing your wonderful work.
You mention GPT in Figure 3 when comparing the Pareto front across different models("AR models of the same size"). May I ask if this is a pre-trained GPT (e.g. GPT2-small) finetuned on the LM1B dataset, or a model with GPT architecture trained from scrach on the LM1B training set?