SOOJEONGKIMM / Paper_log

papers to-read list + issue
0 stars 0 forks source link

RoBERTa: A Robustly Optimized BERT Pretraining Approach #9

Closed SOOJEONGKIMM closed 1 year ago

SOOJEONGKIMM commented 1 year ago

RoBERTa: A Robustly Optimized BERT Pretraining Approach

https://arxiv.org/pdf/1907.11692.pdf

SOOJEONGKIMM commented 1 year ago

RoBERTA: replication study of BERT BERT was significantly undertrained. => hyperparameter tuning and training test size. (1)longer, bigger batches, more data (2) removing nsp loss (3) longer sequence (4) dynamic masking pattern

SOOJEONGKIMM commented 1 year ago

Result: exceed the performance of every model published after BERT. => SOTA in GLUE, RACE, SQuAD. => highlights the importance of previously overlooked design choices!

SOOJEONGKIMM commented 1 year ago

Static vs Dynamic Masking: Original BERT: static masking. Dynamic Masking: Avoids same mask at each training instance in every epoch. training data duplicated 10 times, each sequence is masked in 10 different ways, over 40 epochs of training.

SOOJEONGKIMM commented 1 year ago

Model Input Format and Next Sentence Prediction: SENTENCE-PAIR+NSP: shorter than 512 tokens, increase the batch size. FULL-SENTENCES: total length is at most 512 tokens. add an extra seperator token between documents. DOC-SENTENCES: similar to FULL-SENTENCES. may not cross document boundaries. shorter than 512 tokens, dynamically increase the batch size.

SOOJEONGKIMM commented 1 year ago

Training with large batches: Large mini-batches Equivalent computational cost

SOOJEONGKIMM commented 1 year ago

Text Encoding: Byte-Pair Encoding (BPE) bytes instead of unicode characters. without additional preprocessing or tokenization of input. BPE achieving slightly worse end-task performance on some tasks. but advantages of universal encoding scheme..

SOOJEONGKIMM commented 1 year ago

RoBERTa: Robustly optimized BERT approach. Dynaminc Masking FULL SENTENCE without NSP Large mini-batches Larger byte-level BPE

two important factors: (이전연구에서 간과) (1) data used for pretraining. (2) number of training passes. ex) XLNet: 10 times more data, batch size 8 times larger, thus seeing 4 times more sequences than BERT.

SOOJEONGKIMM commented 1 year ago

RoBERTa Conclusions: Training the model longer with bigger batches over more data. Removing nsp objective. Training on longer sequence. Dynamically changing the masking pattern.

SOTA result

importance of design decisions. BERT's pretraining objective remains competitive.