Why only use pre-trained BERT Tokenizer but not the entire pre-trained BERT model(including the pre-trained encoder)?

CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models

MIT License

11.65k stars 1.52k forks source link

Why only use pre-trained BERT Tokenizer but not the entire pre-trained BERT model(including the pre-trained encoder)? #115

Open KevinGoodman opened 2 years ago

KevinGoodman commented 2 years ago

I am not sure why the implementation only use the tokenizer from hugging face but did not use the pre-trained encoder. I mean why need to retrain the BERT-like transformer? Is the text embedding from the original BERT model not good enough? And why not use fine-tune instead of training from scratch?

ziqihuangg commented 1 year ago

Hi, I have a similar question here. I noticed that in BERTEmbedder, the embedding for each text token is trainable (requires_grad=true). Is there any particular reason to make the text embedding trainable? Why don't we utilize some pre-trained powerful word embeddings? Thank you!