create_datasets_from_start.sh Multiple isssus

Esaada commented 4 years ago

Hi, I'm trying to pretrain Bert Large, I'm trying to download and preprocess the data. I have multiple issues: 1.Downloading- I'm getting very low amount of valid links in BookCorpus, after downloading I got only 250 txt files. I know it is problematic issue, but 250? 2.TextFormatting- Wikipedia dataset text formatting extracts only AA AB directories under data/extracted/wikicorpus_en/ directory. The downloaded file size was 73G, is this normal? 3.sharding- this step just dies, I'm running this script and getting: data/create_datasets_from_start.sh: line 38: 60 Killed python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action sharding --dataset books_wiki_en_corpus Where my input files: input file: /workspace/bert/data/formatted_one_article_per_line/bookscorpus_one_book_per_line.txt input file: /workspace/bert/data/formatted_one_article_per_line/wikicorpus_en_one_article_per_line.txt Both exists. Since the error is not informative at all, I have no idea how to take a step forward in solving this issue. Thanks a lot.

nvcforster commented 4 years ago

Hi Esaada,

Unfortunately Item 1 is outside of our control. Bookscorpus data is not hosted by NVIDIA, and the owners appear to be limiting downloads (likely by IP address/range). The download script is functioning, and it does not stop or repeat attempts for individual books since that would make downloading even a majority of books less likely to succeed (the scripts do skip files that are already present in subsequent attempts). Depending on what your downstream task/purpose is, you can use only wikipedia and get reasonable results.

If you need book data, you can potentially pull from alternate sources, such as Project Gutenberg. Terms of Use - Be sure to check that your use cases are permitted by the licensing terms. Automated downloads.

Item 2 is reasonable and due to an option for the size of output chunks. The option (-b 100MB) is in this line of code. You can remove the argument for the default of 1MB.

Item 3 is likely due to the memory capacity of your machine. Fitting the datasets in memory and operating on them can require a few hundred GB of RAM. Please let us know how much memory you have available, and we can provide some ideas to workaround the sharding step.

Which BERT implementation are you using, TF or PyT?

Esaada commented 4 years ago

Thanks , understood. About #2: so it means that no data is missing, right? About #3: I'm using aws , I have a machine of 90gb, so it make sense, I have an EBS of 1.5T , can I use it somehow? oh wait you mean RAM?!, that means that I need more than a few cores.. This is what I'm using right now is g4dn.xlarge, with 4 cores and 16gib. is there anything to do? I'm using TF.

Thanks again.

nvcforster commented 4 years ago

Item 2: Correct, no data should be missing (i.e., expect fewer extracted files with larger chunks)

Item 3: I hacked together a basic sharding scheme that should require almost no memory (link). The sharding scheme in the official example is beneficial when going to large scales (e.g., ~1500 GPUs) and is geared mainly towards the PyT implementation. This will likely get replaced in the future. The simplified version provided in the link should be fine for smaller scales (probably even up to 32-64 DGX2 nodes). You can comment out the sharding step in the official example script and call create_pretraining_data.py on the shards resulting from the simplified script.

There are some comments in the script provided to describe expected inputs, and there is a sample file to test the script on to show the output format. The linked script is shared under the same license as the NVIDIA Deep Learning Examples.

Please let me know if that works for you.

Esaada commented 4 years ago

Thanks, sounds interesting, one small question : The original script gets two files as input: /workspace/bert/data/formatted_one_article_per_line/bookscorpus_one_book_per_line.txt /workspace/bert/data/formatted_one_article_per_line/wikicorpus_en_one_article_per_line.txt In the cheap script I see that there is 1 input file, Should I enter them separately?

nvcforster commented 4 years ago

You can concatenate those text files. This single text file can be run through the sharding script.

Esaada commented 4 years ago

Still getting "Killed" no other warning or error. Perhaps it's memory issues, I got 15GB available, or still Ram issue?

NVIDIA / DeepLearningExamples

create_datasets_from_start.sh Multiple isssus #489