JohnGiorgi / DeCLUTR

The corresponding code from our paper "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations". Do not hesitate to open an issue if you run into any trouble!
https://aclanthology.org/2021.acl-long.72/
Apache License 2.0
378 stars 33 forks source link

Does the model borrow the initial weights from other models? #148

Closed masoudh175 closed 4 years ago

masoudh175 commented 4 years ago

Hi,

Thanks for the well written paper! I have two questions:

  1. In the paper it is mentioned that To make the computational requirements feasible, we do not train from scratch, but rather we continue training a model that has been pre-trained with the MLM objective. Specifically, we use both RoBERTa-base [16] and DistilRoBERTa [49] (a distilled version of RoBERTa-base) in our experiments. Does it mean that Training your own model notebook uses the weights from per-trained models and then fine-tunes them based on my dataset?

  2. If the answer to the first question is yes, then I assume the embeddings for vocabularies that do not exist in the pre-trained models, are trained from scratch. Correct?

Thanks!

JohnGiorgi commented 4 years ago

Does it mean that Training your own model notebook uses the weights from per-trained models and then fine-tunes them based on my dataset?

Yes. Our paper, repo, and example notebooks all begin from DistilRoBERTa or RoBERTa pre-trained checkpoints (for DeCLUTR-small and DeCLUTR-base respectively) and extend their training with our contrastive objective. "Fine-tunes" probably isn't the right terminology here (at least I think) as it usually denotes supervised training on a small, task-specific dataset.

If the answer to the first question is yes, then I assume the embeddings for vocabularies that do not exist in the pre-trained models, are trained from scratch. Correct?

No. Both DistilRoBERTa and RoBERTa (and therefore, DeCLUTR-small and DeCLUTR-base) use byte-pair encoding tokenization. Please see this blog for more information. Basically, words that are not in the vocabulary are broken down into smaller subunits that are in the vocabulary (all the way to individual characters, if needed) and then embedded.

Please let me know if I have answered all your questions!

masoudh175 commented 4 years ago

Yes, I got my answer. Thanks!