What if DiffusionLM is initialized with BERT?

Hzfinfdu commented 1 year ago

Hi, Lisa.

Thank you for your wonderful paper and for sharing the code. I notice in the code that one can initialize the transformer encoder with BERT. I'm wondering what will such initialization bring. Does it help DiffusionLM to converge way faster or achieve better generation results? And is there possibly any negative effect on DiffusionLM if initialized with BERT? Thanks!

XiangLi1999 commented 1 year ago

Hi,

Thanks for the question! Empirically, initializing with pre-trained bert parameter doesn't really help. I believe this is because we learn our new embeddings, which is different from BERT embeddings, so a pre-trained model needs to first unlearn the old embeddings and learn the new embeddings, which might be a significant modeling burden.

A followup question might be: why not using bert embeddings for diffusion. We had an ablation about the impact of embedding dimensions, and it's not the larger the better for diffusion-LM. Specifically BERT has 768 dim, which is too large for diffusion.

Hope this helps.

Best, Lisa

Hzfinfdu commented 1 year ago

Hi,

Thanks for your reply! It helps me much.

XiangLi1999 / Diffusion-LM

What if DiffusionLM is initialized with BERT? #40