NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.29k stars 3.18k forks source link

[Bert/Pytorch] pretraining FileNotFoundError #844

Closed jzhang82119 closed 3 years ago

jzhang82119 commented 3 years ago

I think I finished the step 5 /workspace/bert/data/create_datasets_from_start.sh in quick start guide. It took a whole day or so.

Now I am trying to do bash scripts/run_pretraining.sh benchmark.

Below is the output.

Container nvidia build = 13419386 Warning! /workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/ directory missing. Training cannot start /workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/ Logs written to /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase1_fp16_gbs65536.210227234326.log

Defaults for this optimization level are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' DLL 2021-02-27 23:43:44.783208 - PARAMETER SEED : 12439 DLL 2021-02-27 23:43:44.783634 - PARAMETER train_start : True DLL 2021-02-27 23:43:44.783717 - PARAMETER batch_size_per_gpu : 64 DLL 2021-02-27 23:43:44.783760 - PARAMETER learning_rate : 0.006 Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' Traceback (most recent call last): File "/workspace/bert/run_pretraining.py", line 678, in args, final_loss, train_time_raw, global_step = main() File "/workspace/bert/run_pretraining.py", line 531, in main files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if FileNotFoundError: [Errno 2] No such file or directory: '/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/' Traceback (most recent call last): File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in main() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/workspace/bert/run_pretraining.py', '--local_rank=7', '--input_dir=/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/', '--output_dir=/workspace/bert/results/checkpoints', '--config_file=bert_config.json', '--bert_model=bert-large-uncased', '--train_batch_size=8192', '--max_seq_length=128', '--max_predictions_per_seq=20', '--max_steps=7038', '--warmup_proportion=0.2843', '--num_steps_per_checkpoint=200', '--learning_rate=6e-3', '--seed=12439', '--fp16', '--gradient_accumulation_steps=128', '--allreduce_post_accumulation', '--allreduce_post_accumulation_fp16', '--do_train', '--json-summary', '/workspace/bert/results/dllogger.json']' returned non-zero exit status 1.


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


finished pretraining /workspace/bert/data/hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/ Logs written to /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase2_fp16_gbs32768.210227234346.log

I have the following folder under data directory. (Noted that folder name hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 is different from hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10 in the script output)

/workspace/bert# ls data BooksDownloader.py BookscorpusTextFormatting.py Downloader.py GLUEDownloader.py GooglePretrainedWeightDownloader.py NVIDIAPretrainedWeightDownloader.py SquadDownloader.py TextSharding.py WikiDownloader.py WikicorpusTextFormatting.py init.py pycache bertPrep.py create_datasets_from_start.sh download extracted formatted_one_article_per_line hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 sharded_training_shards_256_test_shards_256_fraction_0.1 squad

sharathts commented 3 years ago

Thank you for pointing this out (it is a name mismatch in run_pretraining.sh). This has been fixed in https://github.com/NVIDIA/DeepLearningExamples/pull/845

jzhang82119 commented 3 years ago

The subfolder also need to be modified. The wikicorpus_en is the lowest level folder, there is no books_wiki_en_corpus/training/

this is what I have hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/

sharathts commented 3 years ago

Thank you. I have updated #845 to reflect the above fix as well.

sharathts commented 3 years ago

@jzhang82119 Also note that hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/books_wiki_en_corpus is the default path assuming both wikipedia and bookcorpus have been downloaded.

It looks like you downloaded wiki only. For that, you can change books_wiki_en_corpus in the path name to wikicorpus_en.