Closed rakeshchada closed 3 years ago
I think section 3.2 of the Roberta Paper has the pertinent info https://arxiv.org/pdf/1907.11692.pdf.
I pasted below and did some markdown:
BERT-style pretraining crucially relies on large quantities of text. Baevski et al. (2019) demonstrate that increasing data size can result in improved end-task performance. Several efforts have trained on datasets larger and more diverse than the original BERT (Radford et al., 2019; Yang et al., 2019; Zellers et al., 2019). Unfortunately, not all of the additional datasets can be publicly released. For our study, we focus on gathering as much data as possible for experimentation, allowing us to match the overall quality and quantity of data as appropriate for each comparison. We consider five English-language corpora of varying sizes and domains, totaling over 160GB of uncompressed text. We use the following text corpora:
Hi @sshleifer, Thanks for sharing. However, I think the info you've provided is for the RoBERTa model. My question is about the corpus used to pre-train "bart.base" model that is publicly released here.
bart-base was also trained on this dataset.
❓ Questions and Help
What is your question?
Hi , On what corpus is the publicly available BART-base pre-trained on? It was not explicitly mentioned in the BART paper but the phrase below from the paper suggests to me that the pre-training corpus used for BART-base is the same as the one used for pre-training BERT-base (Wikipedia + BooksCorpus). Is that correct?
"For reference, we compare our implementations with published numbers from BERT, which was also trained for 1M steps on a combination of books and Wikipedia data."
Appreciate if you can clarify this. Tagging @sshleifer .