-
### GPT-3 data mix
* Datasets are not sampled in proportion to their size
* Datasets we view as higher-quality are sampled more frequently
* WebText2, Book1, Wikipedia datasets are sampl…
-
Can you please provide little tutorial about how to input the image and get the story.?
Since you have used `bookcorpus` so If possible please also provide some guide about how to trail model on di…
-
Hi team et al,
I'd like to know how to process bookcorpus to pre-training.
I am confusing to process this data.
Should I treat 1 book as a document including all sentences or 1 chapter as a docu…
-
Hello
I was wondering if you could share the books corpus as the crawling takes pretty long. I was using it
https://github.com/soskek/bookcorpus
-
I pretrained both BERT uncased as well as BERT cased models using the same hyperparameters(for uncased model) on Wikipedia and BookCorpus, but the BERT cased models perform worse than the google check…
-
Hi. I am trying to convert corpora from HF to their IPA form with the following snippet. But I am getting really slow speeds.. only a couple of examples per second. Do you know how it can be sped up? …
-
```bash
# dataset/download_books.sh
wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_docu…
-
I am trying to follow the example here
https://www.deepspeed.ai/tutorials/bert-pretraining/
The section on getting the datasets says 'Note: Downloading and pre-processing instructions are coming…
-
Traceback (most recent call last):
File "E:\RetroMAE-master\RetroMAE-master\examples\pretrain\preprocess.py", line 158, in
wiki = create_wiki_data(args.tokenizer_name, args.max_seq_length, ar…
-
I know BERT has achieved SOTA in many NLP tasks, such as SQuAD and SWAG.
But note that the data (both training and test) of SQuAD is from **Wikipedia**, and that of SWAG is from the **BookCorpus**, a…