NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.62k stars 3.24k forks source link

[DLRM/TensorFlow2] The --dummy_model option For Training Does Not Seem To Work #1317

Open psharpe99 opened 1 year ago

psharpe99 commented 1 year ago

Related to DLRM/TensorFlow2

Describe the bug I generated synthetic data as per the README instructions, and tried to perform a training run. This failed when creating a dlrm training pipeline, for which a separate issue has been created. Looking at the main.py code, it supports a _--dummymode command-line option flags.DEFINE_bool("dummy_model", default=False, help="Use a dummy model for benchmarking and debugging") for creating a dummy dlrm object if FLAGS.dummy_model: dlrm = DummyDlrm(FLAGS=FLAGS, dataset_metadata=dataset_metadata) else: dlrm = Dlrm(FLAGS=FLAGS, dataset_metadata=dataset_metadata) which is then used to create a _trainpipeline.

I thought that this may have got me past the zero-sized train_pipeline from the non-dummy dlrm object: this is just a proof-of-process experiment with the DL examples, so I'm not too worried about the model output, accuracy etc. However, the dummy dlrm object does not seem to include all required internal fields: in particular the training run complains about _local_tableids not being part of the dummy object.

To Reproduce root@7cf568793223:/dlrm# horovodrun -np 1 -H localhost:1 --mpi-args=--oversubscribe numactl --interleave=all -- python -u main.py --dataset_path /data/converted --amp --xla --dummy_model 2023-06-26 10:57:38.882194: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. [1,0]:2023-06-26 10:57:41.183238: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. [1,0]:DLL 2023-06-26 10:57:43.301103 - PARAMETER logtostderr : False alsologtostderr : False log_dir : v : 0 verbosity : 0 logger_levels : {} stderrthreshold : fatal showprefixforinfo : True run_with_pdb : False pdb_post_mortem : False pdb : False run_with_profiling : False profile_file : None use_cprofile_for_profiling : True only_check_args : False op_conversion_fallback_to_while_loop : True runtime_oom_exit : True hbm_oom_exit : True test_srcdir : test_tmpdir : /tmp/absl_testing test_random_seed : 301 test_randomize_ordering_seed : xml_output_file : mode : train learning_rate : 24.0 batch_size : 65536 run_eagerly : False dummy_model : True top_mlp_dims : [1024, 1024, 512, 256, 1] bottom_mlp_dims : [512, 256, 128] optimizer : sgd save_checkpoint_path : None restore_checkpoint_path : None saved_model_output_path : None save_input_signature : False saved_model_input_path : None cpu : False amp : True fp16 : False xla : True loss_scale : 1024 auc_thresholds : 8000 epochs : 1 max_steps : -1 embedding_trainable : True dot_interaction : custom_cuda embedding_dim : 128 evals_per_epoch : 1 print_freq : 100.0 warmup_steps : 8000 decay_start_step : 48000 decay_steps : 24000 profiler_start_step : None profiled_rank : 1 inter_op_parallelism : None intra_op_parallelism : None dist_strategy : memory_balanced use_merlin_de_embeddings : False column_slice_threshold : 10000000000 log_path : dlrm_tf_log.json dataset_path : /data/converted feature_spec : feature_spec.yaml dataset_type : tf_raw synthetic_dataset_use_feature_spec : False synthetic_dataset_train_batches : 64008 synthetic_dataset_valid_batches : 1350 synthetic_dataset_cardinalities : [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000] synthetic_dataset_num_numerical_features : 13 ? : False help : False helpshort : False helpfull : False helpxml : False [1,0]:Command line flags: [1,0]:{ [1,0]: "logtostderr": false, [1,0]: "alsologtostderr": false, [1,0]: "log_dir": "", [1,0]: "v": 0, [1,0]: "verbosity": 0, [1,0]: "logger_levels": {}, [1,0]: "stderrthreshold": "fatal", [1,0]: "showprefixforinfo": true, [1,0]: "run_with_pdb": false, [1,0]: "pdb_post_mortem": false, [1,0]: "pdb": false, [1,0]: "run_with_profiling": false, [1,0]: "profile_file": null, [1,0]: "use_cprofile_for_profiling": true, [1,0]: "only_check_args": false, [1,0]: "op_conversion_fallback_to_while_loop": true, [1,0]: "runtime_oom_exit": true, [1,0]: "hbm_oom_exit": true, [1,0]: "test_srcdir": "", [1,0]: "test_tmpdir": "/tmp/absl_testing", [1,0]: "test_random_seed": 301, [1,0]: "test_randomize_ordering_seed": "", [1,0]: "xml_output_file": "", [1,0]: "mode": "train", [1,0]: "learning_rate": 24.0, [1,0]: "batch_size": 65536, [1,0]: "run_eagerly": false, [1,0]: "dummy_model": true, [1,0]: "top_mlp_dims": [ [1,0]: 1024, [1,0]: 1024, [1,0]: 512, [1,0]: 256, [1,0]: 1 [1,0]: ], [1,0]: "bottom_mlp_dims": [ [1,0]: 512, [1,0]: 256, [1,0]: 128 [1,0]: ], [1,0]: "optimizer": "sgd", [1,0]: "save_checkpoint_path": null, [1,0]: "restore_checkpoint_path": null, [1,0]: "saved_model_output_path": null, [1,0]: "save_input_signature": false, [1,0]: "saved_model_input_path": null, [1,0]: "cpu": false, [1,0]: "amp": true, [1,0]: "fp16": false, [1,0]: "xla": true, [1,0]: "loss_scale": 1024, [1,0]: "auc_thresholds": 8000, [1,0]: "epochs": 1, [1,0]: "max_steps": -1, [1,0]: "embedding_trainable": true, [1,0]: "dot_interaction": "custom_cuda", [1,0]: "embedding_dim": 128, [1,0]: "evals_per_epoch": 1, [1,0]: "print_freq": 100.0, [1,0]: "warmup_steps": 8000, [1,0]: "decay_start_step": 48000, [1,0]: "decay_steps": 24000, [1,0]: "profiler_start_step": null, [1,0]: "profiled_rank": 1, [1,0]: "inter_op_parallelism": null, [1,0]: "intra_op_parallelism": null, [1,0]: "dist_strategy": "memory_balanced", [1,0]: "use_merlin_de_embeddings": false, [1,0]: "column_slice_threshold": 10000000000, [1,0]: "log_path": "dlrm_tf_log.json", [1,0]: "dataset_path": "/data/converted", [1,0]: "feature_spec": "feature_spec.yaml", [1,0]: "dataset_type": "tf_raw", [1,0]: "synthetic_dataset_use_feature_spec": false, [1,0]: "synthetic_dataset_train_batches": 64008, [1,0]: "synthetic_dataset_valid_batches": 1350, [1,0]: "synthetic_dataset_cardinalities": [ [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000, [1,0]: 1000 [1,0]: ], [1,0]: "synthetic_dataset_num_numerical_features": 13, [1,0]: "?": false, [1,0]: "help": false, [1,0]: "helpshort": false, [1,0]: "helpfull": false, [1,0]: "helpxml": false [1,0]:} [1,0]:INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK [1,0]:Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0 [1,0]:I0626 10:57:43.303895 139923971999552 device_compatibility_check.py:123] Mixed precision compatibility check (mixed_float16): OK [1,0]:Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0 [1,0]:2023-06-26 10:57:43.316639: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,0]:2023-06-26 10:57:44.161367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 37981 MB memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0 [1,0]:Traceback (most recent call last): [1,0]: File "main.py", line 365, in [1,0]: app.run(main) [1,0]: File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run [1,0]: _run_main(main, args) [1,0]: File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main [1,0]: sys.exit(main(argv)) [1,0]: File "main.py", line 254, in main [1,0]: train_pipeline, validation_pipeline = create_input_pipelines(FLAGS, dlrm.local_table_ids) [1,0]:AttributeError: 'DummyDlrm' object has no attribute 'local_table_ids'

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[38209,1],0] Exit code: 1

Expected behavior I expected that as this seems to be a supported option to the main.py flags.DEFINE_bool("dummy_model", default=False, help="Use a dummy model for benchmarking and debugging") then it would allow training on my generated synthetic data

Environment root@7cf568793223:/dlrm# cat /etc/os-release NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)"

root@7cf568793223:/dlrm# python --version Python 3.8.10

root@7cf568793223:/dlrm# nvidia-smi Mon Jun 26 11:10:45 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:31:00.0 Off | 0 | | N/A 34C P0 34W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:4B:00.0 Off | 0 | | N/A 37C P0 26W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

tgrel commented 1 year ago

Hi @psharpe99,

The dummy model functionality is no longer supported with the current release. To be fair, it would not have fixed the pipeline issue you're referring to in the other issue. However, the redesigned synthetic dataset functionality should unblock you. You can use it by following the Quick Start Guide.