-
The BookCorpus dataset located at http://www.cs.toronto.edu/~mbweb/ returns a 403.
Can you provide a mirror or at the very least give a few lines as an example of how the dataset needs to be formatte…
-
## Description
The book corpus can now have a reliable, stable download link from https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz. Also, there are more links in https://the-eye…
-
Hi guys,
I have been trying to run the Bing experiment but it seems I can't for now.
```
"datasets": {
--
| "wiki_pretrain_dataset": "/data/bert/bnorick_format/128/wiki_pretrain",
| "bc_pr…
-
Rather have something like 200GB disk for that and at least 4 CPUs (expected time ~4 days, the more CPUs the faster)
Since we have the original results, we can compare our results and have some sanity…
-
Refer to README.md, search for BookCorpus & click on the link. It will redirect you to [here](https://yknzhu.wixsite.com/mbweb) & you will notice BookCorpus is no longer available from the original au…
-
### GPT-3 data mix
* Datasets are not sampled in proportion to their size
* Datasets we view as higher-quality are sampled more frequently
* WebText2, Book1, Wikipedia datasets are sampl…
-
The dataset I am using is Book Corpus having 18000 books
The system i am training on is having 64GB of RAM
When I am trying to generate the pretraining data using create_pretraining_data.py it is g…
-
## ❓ Questions and Help
**Description**
Hi I am training quick though task dot(sent1~ sent2) and dot(sent2~sent2) but my dataset is 12 GB and it throws memory error once it crosses 575 GB in memor…
-
I find `BERT` uses `BookCorpus (800M words)` and `Wikipedia (2500M words)` but `GPT` only uses `BookCorpus`, even `BERT` has complex model structure which may leads to effect representation ability, t…
-
You can download it here: https://twitter.com/theshawwn/status/1301852133319294976?s=21
it contains 18k plain text files. The results are very high quality. I spent about a week fixing the epub2txt…