Closed paulhendricks closed 4 years ago
You can just ignore the bookscorpus files that are missing. They dont exist anymore on the web.
Hi @swethmandava, the script run_pretraining_lamb.sh throws a bunch of errors because it is still referencing these datasets, is there another script to conduct the pre-training process on BERT with available datasets?
Could you open another bug with details of your errors? @vilmara
I have found how to work only with English Wikipedia dataset which is still available, ignoring BookCorpus dataset
Related to Model/Framework(s) PyTorch/LanguageModeling/BERT
Describe the bug BookCorpus no longer available from Smashwords.
To Reproduce
The following works perfectly.
However, errors start here:
Expected behavior
BookCorpus should download. This looks similar to:
Looks like https://www.smashwords.com/ has stepped up their anti-web crawling. In fact, after attempting to download my IP address is now blocked from their website. Users should be aware of this before we ask them to download the BookCorpus dataset lest they become banned, unaware of the consequences.
Environment Please provide at least:
Git commit: c76880b0fb211671b83bec47576305d424617009
Container version (e.g. pytorch:19.05-py3):
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 4x Tesla V100, DGX Station
CUDA driver version (e.g. 418.67): 418.126.02