NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
12.94k stars 3.12k forks source link

[ELECTRA/TF2] Option To Allow scripts/run_pretraining.sh To Use "wiki_only" #1319

Open psharpe99 opened 1 year ago

psharpe99 commented 1 year ago

Related to ELECTRA/TF2

Is your feature request related to a problem? Please describe. The README shows that the datasets can be created from wiki-only: /workspace/electra/data/create_datasets_from_start.sh wiki_books but when you then continue to pretrain using the README instruction bash scripts/run_pretraining.sh it complains about the file/directory not existing. Looking at the run_pretraining.sh script, it has DATASET_P1="tfrecord_lower_case_1_seq_len_128_random_seed_12345/books_wiki_en_corpus/train/pretrain_data" # change this for other datasets DATASET_P2="tfrecord_lower_case_1_seq_len_512_random_seed_12345/books_wiki_en_corpus/train/pretrain_data" # change this for other datasets which are preset to the books_wiki directory, with the comment that these need to be (manually) "changed" for other datasets (e.g. wiki-only) Changing these manually to the 'wikicorpus_en' directory allowed the pretraining to succeed, but the script ideally shouldn't need editing.

Describe the solution you'd like It should be a simple change to include a command-line option to the run_pretraining script for "wiki-only" .

Describe alternatives you've considered Alternatively, it should be documented in the README that this script file needs to be editted if running only from wiki data.

Additional context none