facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

Which pre-training corpus is used for BART-base? #3550

Closed rakeshchada closed 3 years ago

rakeshchada commented 3 years ago

❓ Questions and Help

What is your question?

Hi , On what corpus is the publicly available BART-base pre-trained on? It was not explicitly mentioned in the BART paper but the phrase below from the paper suggests to me that the pre-training corpus used for BART-base is the same as the one used for pre-training BERT-base (Wikipedia + BooksCorpus). Is that correct?

"For reference, we compare our implementations with published numbers from BERT, which was also trained for 1M steps on a combination of books and Wikipedia data."

Appreciate if you can clarify this. Tagging @sshleifer .

sshleifer commented 3 years ago

I think section 3.2 of the Roberta Paper has the pertinent info https://arxiv.org/pdf/1907.11692.pdf.

I pasted below and did some markdown:

3.2 Data

BERT-style pretraining crucially relies on large quantities of text. Baevski et al. (2019) demonstrate that increasing data size can result in improved end-task performance. Several efforts have trained on datasets larger and more diverse than the original BERT (Radford et al., 2019; Yang et al., 2019; Zellers et al., 2019). Unfortunately, not all of the additional datasets can be publicly released. For our study, we focus on gathering as much data as possible for experimentation, allowing us to match the overall quality and quantity of data as appropriate for each comparison. We consider five English-language corpora of varying sizes and domains, totaling over 160GB of uncompressed text. We use the following text corpora:

rakeshchada commented 3 years ago

Hi @sshleifer, Thanks for sharing. However, I think the info you've provided is for the RoBERTa model. My question is about the corpus used to pre-train "bart.base" model that is publicly released here.

sshleifer commented 3 years ago

bart-base was also trained on this dataset.