HabanaAI / Gaudi-tutorials

Tutorials for running models on First-gen Gaudi and Gaudi2 for Training and Inference. The source files for the tutorials on https://developer.habana.ai/
https://developer.habana.ai/tutorials/
56 stars 36 forks source link

Habana Bert Training on Aws Gaudi1 #1

Open rajeshitshoulders opened 1 year ago

rajeshitshoulders commented 1 year ago

Hi, I need to run MLPerf 2.0 Intel-Habana Bert training on Aws Gaudi1 processor with image Deep Learning AMI Habana PyTorch 1.12.0 SynapseAI 1.6.0 (Ubuntu 20.04) 20220928. Followed below readme

Readme: https://github.com/mlcommons/training_results_v2.0/tree/main/Intel-HabanaLabs/benchmarks Dataset : https://github.com/mlcommons/training/tree/master/language_model/tensorflow/bert#download-and-preprocess-datasets Aws Deep Learning AMI Habana PyTorch 1.12.0 SynapseAI 1.6.0 (Ubuntu 20.04) 20220928 Guaid1 VM, RAM - 742GB and 96core

I've challenges in converting datasets into tf_records, packing script pack_pretraining_data_tfrec never succeeded due memory issue

To convert tf_records with unzipped dataset results_text.zip, when I ran pretraining/create_pretraining_data.py with --input_file=/root/datasets/results4/part-0000* option it tooks almost all the available 742GB memory and swap.

so, I converted each part file into tf_record using for loop.

To use packed method in training, I used script pack_pretraining_data_tfrec to covert tf_records with --max-files option 10 files (default 100), but looks like script load all tf_records into memory sequentially before start pack to create strategy files, hence fill-up all avaiable 742 memory and failed pack.

Hence I tried with Unpacked method, for that I converted tf_records into binary file using script record_to_binary script from GraphCore v1.0 submission,(https://github.com/mlcommons/training_results_v1.0/tree/master/Graphcore/benchmarks/bert/implementations/popart/bert_data) When i run training process, getting corrupt data.

Questions: Is it right procedure to convert 1/ dataset part file into tf_records one at time. 2/ convert tf_records part-000* into binary file? can the resulted part-*** can be used for unpacked method? 3/ how to get limit max-files to 10 or 25 files in packing?

Please advise if there are any alternative method to pack Bert wiki dataset for Mlperf v2.0 Bert Training for Gaudi.

greg-serochi commented 1 year ago

Hi @rajeshitshoulders, thanks for posting this question. We'll investigate and respond soon.

In the future it's best to post these types of questions to our user forum https://forum.habana.ai.

rajeshitshoulders commented 1 year ago

Hi Greg, any update on this?

greg-serochi commented 1 year ago

Still investigating. Sorry for the delay.

greg-serochi commented 1 year ago

Hi @rajeshitshoulders , please follow the instructions from Habana's site here: https://github.com/HabanaAI/Model-References/tree/master/MLPERF2.1/Habana/benchmarks#training-data-for-pytorch-bert

We use Packed data for the dataset for best performance. The DL1 instance is not needed to run this Data Preprocessing, so you can use a CPU based instance with 1TB+ of memory and larger disk storage to be able to process the data.

More background here: https://developer.habana.ai/tutorials/tensorflow/data-packing-process-for-mlperf-bert/

rajeshitshoulders commented 1 year ago

Thankyou Greg, working on dataset preparation as provided

rajeshitshoulders commented 1 year ago

Hi Greg, I was able to prepare dataset and packed as instructed ran MLperf2.0 training code on DL1 instances. But we had below error with evaluation, but our developer was able to fix and make sure evaluation completed

[1,1]:INFO:tensorflow:Inference Time : 8.13254s [1,1]:INFO:tensorflow:Finished evaluation at 2023-03-10-00:09:46 [1,1]:INFO:tensorflow:Saving dict for global step 285: global_step = 285, loss = 8.159287, masked_lm_accuracy = 0.05105472, masked_lm_loss = 7.4563823, next_sentence_accuracy = 0.59999996, next_sentence_loss = 0.68749994 [1,1]:INFO:tensorflow:Saving 'checkpoint_path' summary for global step 285: /mnt/dramfs/bert_gaudi4_2023-03-09_235729/ip-10-0-20-243/model.ckpt-285 [1,1]:Traceback (most recent call last): [1,1]: File "/root/MLPERF/Intel-HabanaLabs/benchmarks/bert/implementations/HLS-1H-N32/../TensorFlow/nlp/bert/run_pretraining.py", line 1423, in [1,1]: tf.compat.v1.app.run() [1,1]: File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 36, in run [1,1]: _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) [1,1]: File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 312, in run [1,1]: _run_main(main, args) [1,1]: File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 258, in _run_main [1,1]: sys.exit(main(argv)) [1,1]: File "/root/MLPERF/Intel-HabanaLabs/benchmarks/bert/implementations/HLS-1H-N32/../TensorFlow/nlp/bert/run_pretraining.py", line 1259, in main [1,1]: mlperf_mlloger.event(key=mlperf_mllog.constants.EVAL_ACCURACY,value=eval_results["masked_lm_accuracy"],time_ms=mlperf_checkpoint_timestamp_dict[ckpt_ind + 1],metadata={'epoch_num': (ckpt_ind + 1)*FLAGS.samples_between_eval,'epoch_count': ckpt_ind + 1}) [1,1]:KeyError: 5

But, now the issues issue we are getting accuracy as 0.10 with training steps 9120 and LR as 0.0045. we are unable to find out the cause of this low accuracy rate.

Please find the log, could you please helpe here? please let me know if you need a full log file and how to share with you.

[1,0]:checkpoint file path=/mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9006 [1,0]:INFO:tensorflow:Calling model_fn. [1,0]:INFO:tensorflow: Features [1,0]:INFO:tensorflow: name = input_ids, shape = (None, 512) [1,0]:INFO:tensorflow: name = input_mask, shape = (None, 512) [1,0]:INFO:tensorflow: name = masked_lm_ids, shape = (None, 76) [1,0]:INFO:tensorflow: name = masked_lm_positions, shape = (None, 76) [1,0]:INFO:tensorflow: name = masked_lm_weights, shape = (None, 76) [1,0]:INFO:tensorflow: name = next_sentence_labels, shape = (None, 1) [1,0]:INFO:tensorflow: name = segment_ids, shape = (None, 512) [1,0]:INFO:tensorflow:AsyncCheckpointSaverHook will be used for checkpoint saving [1,0]:INFO:tensorflow:Create AsyncCheckpointSaverHook saving to path [1,0]:/mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt [1,0]:INFO:tensorflow:Done calling model_fn. [1,0]:INFO:tensorflow:Starting evaluation at 2023-03-11T04:37:41 [1,0]:INFO:tensorflow:Graph was finalized. [1,0]:INFO:tensorflow:Restoring parameters from /mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9006 [1,0]:2023-03-11 04:37:43.142657: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_scalarsize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:43.142701: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:43.142708: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_valid_shape/Identitysize=1 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:43.142712: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:43.142717: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_scalarsize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:43.142721: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:43.142726: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_valid_shape/Identitysize=1 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:43.142730: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:43.142735: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_valid_shape/else/_1/has_valid_nonscalar_shape/is_same_ranksize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:43.142740: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:43.142744: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_valid_shape/else/_24/has_valid_nonscalar_shape/is_same_ranksize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:43.142748: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:INFO:tensorflow:Running local_init_op. [1,0]:INFO:tensorflow:Done running local_init_op. [1,0]:INFO:tensorflow:Evaluation [1/1] [1,0]:INFO:tensorflow:Inference Time : 8.20981s [1,0]:INFO:tensorflow:Finished evaluation at 2023-03-11-04:37:49 [1,0]:INFO:tensorflow:Saving dict for global step 9006: global_step = 9006, loss = 7.454265, masked_lm_accuracy = 0.10301063, masked_lm_loss = 6.8899846, next_sentence_accuracy = 0.67499995, next_sentence_loss = 0.571875 [1,0]:INFO:tensorflow:Saving 'checkpoint_path' summary for global step 9006: /mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9006 [1,0]:INFO:tensorflow:**** [1,0]:INFO:tensorflow:{'loss': 7.454265, 'masked_lm_accuracy': 0.10301063, 'masked_lm_loss': 6.8899846, 'next_sentence_accuracy': 0.67499995, 'next_sentence_loss': 0.571875, 'global_step': 9006} [1,0]::::MLLOG {"namespace": "worker0", "time_ms": 1678508113577, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.1030106320977211, "metadata": {"file": "/root/datasets/training_results_v2.0/Intel-HabanaLabs/benchmarks/bert/implementations/HLS-1H-N32/../TensorFlow/nlp/bert/run_pretraining.py", "lineno": 1262, "epoch_num": 147744, "epoch_count": 18}} [1,0]:per rank mlm accuracy=0.1030106320977211 [1,0]:checkpoint file path=/mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9063 [1,0]:INFO:tensorflow:Calling model_fn. [1,0]:INFO:tensorflow: Features [1,0]:INFO:tensorflow: name = input_ids, shape = (None, 512) [1,0]:INFO:tensorflow: name = input_mask, shape = (None, 512) [1,0]:INFO:tensorflow: name = masked_lm_ids, shape = (None, 76) [1,0]:INFO:tensorflow: name = masked_lm_positions, shape = (None, 76) [1,0]:INFO:tensorflow: name = masked_lm_weights, shape = (None, 76) [1,0]:INFO:tensorflow: name = next_sentence_labels, shape = (None, 1) [1,0]:INFO:tensorflow: name = segment_ids, shape = (None, 512) [1,0]:INFO:tensorflow:AsyncCheckpointSaverHook will be used for checkpoint saving [1,0]:INFO:tensorflow:Create AsyncCheckpointSaverHook saving to path [1,0]:/mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt [1,0]:INFO:tensorflow:Done calling model_fn. [1,0]:INFO:tensorflow:Starting evaluation at 2023-03-11T04:37:55 [1,0]:INFO:tensorflow:Graph was finalized. [1,0]:INFO:tensorflow:Restoring parameters from /mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9063 [1,0]:2023-03-11 04:37:57.404176: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_scalarsize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:57.404226: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:57.404232: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_valid_shape/Identitysize=1 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:57.404237: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:57.404242: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_scalarsize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:57.404246: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:57.404250: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_valid_shape/Identitysize=1 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:57.404255: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:57.404259: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_valid_shape/else/_1/has_valid_nonscalar_shape/is_same_ranksize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:57.404263: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:37:57.404267: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_valid_shape/else/_24/has_valid_nonscalar_shape/is_same_ranksize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:37:57.404272: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:INFO:tensorflow:Running local_init_op. [1,0]:INFO:tensorflow:Done running local_init_op. [1,0]:INFO:tensorflow:Evaluation [1/1] [1,0]:INFO:tensorflow:Inference Time : 7.26540s [1,0]:INFO:tensorflow:Finished evaluation at 2023-03-11-04:38:03 [1,0]:INFO:tensorflow:Saving dict for global step 9063: global_step = 9063, loss = 7.4505024, masked_lm_accuracy = 0.10360095, masked_lm_loss = 6.8901644, next_sentence_accuracy = 0.67499995, next_sentence_loss = 0.56874996 [1,0]:INFO:tensorflow:Saving 'checkpoint_path' summary for global step 9063: /mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9063 [1,0]:INFO:tensorflow:**** [1,0]:INFO:tensorflow:{'loss': 7.4505024, 'masked_lm_accuracy': 0.10360095, 'masked_lm_loss': 6.8901644, 'next_sentence_accuracy': 0.67499995, 'next_sentence_loss': 0.56874996, 'global_step': 9063} [1,0]::::MLLOG {"namespace": "worker0", "time_ms": 1678508240866, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.10360094904899597, "metadata": {"file": "/root/datasets/training_results_v2.0/Intel-HabanaLabs/benchmarks/bert/implementations/HLS-1H-N32/../TensorFlow/nlp/bert/run_pretraining.py", "lineno": 1262, "epoch_num": 155952, "epoch_count": 19}} [1,0]:per rank mlm accuracy=0.10360094904899597 [1,0]:checkpoint file path=/mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9120 [1,0]:INFO:tensorflow:Calling model_fn. [1,0]:INFO:tensorflow: Features [1,0]:INFO:tensorflow: name = input_ids, shape = (None, 512) [1,0]:INFO:tensorflow: name = input_mask, shape = (None, 512) [1,0]:INFO:tensorflow: name = masked_lm_ids, shape = (None, 76) [1,0]:INFO:tensorflow: name = masked_lm_positions, shape = (None, 76) [1,0]:INFO:tensorflow: name = masked_lm_weights, shape = (None, 76) [1,0]:INFO:tensorflow: name = next_sentence_labels, shape = (None, 1) [1,0]:INFO:tensorflow: name = segment_ids, shape = (None, 512) [1,0]:INFO:tensorflow:AsyncCheckpointSaverHook will be used for checkpoint saving [1,0]:INFO:tensorflow:Create AsyncCheckpointSaverHook saving to path [1,0]:/mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt [1,0]:INFO:tensorflow:Done calling model_fn. [1,0]:INFO:tensorflow:Starting evaluation at 2023-03-11T04:38:10 [1,0]:INFO:tensorflow:Graph was finalized. [1,0]:INFO:tensorflow:Restoring parameters from /mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9120 [1,0]:2023-03-11 04:38:11.648713: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_scalarsize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:38:11.648759: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:38:11.648765: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_valid_shape/Identitysize=1 has_cpu_inputs=0 [1,0]:2023-03-11 04:38:11.648770: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:38:11.648774: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_scalarsize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:38:11.648778: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:38:11.648785: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_valid_shape/Identitysize=1 has_cpu_inputs=0 [1,0]:2023-03-11 04:38:11.648789: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:38:11.648793: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking accuracy/broadcast_weights/assert_broadcastable/is_valid_shape/else/_1/has_valid_nonscalar_shape/is_same_ranksize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:38:11.648798: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:2023-03-11 04:38:11.648802: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:152] Checking mean/broadcast_weights/assert_broadcastable/is_valid_shape/else/_24/has_valid_nonscalar_shape/is_same_ranksize=3 has_cpu_inputs=0 [1,0]:2023-03-11 04:38:11.648806: E /home/jenkins/workspace/cdsoftwarebuilder/create-tensorflow-module---bpt-d/tensorflow-training/habana_device/graph_passes/control_computation_placement.cpp:161] - rejected due to no-cpu-inputs [1,0]:INFO:tensorflow:Running local_init_op. [1,0]:INFO:tensorflow:Done running local_init_op. [1,0]:INFO:tensorflow:Evaluation [1/1] [1,0]:INFO:tensorflow:Inference Time : 8.06458s [1,0]:INFO:tensorflow:Finished evaluation at 2023-03-11-04:38:18 [1,0]:INFO:tensorflow:Saving dict for global step 9120: global_step = 9120, loss = 7.449696, masked_lm_accuracy = 0.103305794, masked_lm_loss = 6.890163, next_sentence_accuracy = 0.67499995, next_sentence_loss = 0.56874996 [1,0]:INFO:tensorflow:Saving 'checkpoint_path' summary for global step 9120: /mnt/dramfs/bert_gaudi4_2023-03-11_033402/ip-10-0-20-243/model.ckpt-9120 [1,0]:INFO:tensorflow:**** [1,0]:INFO:tensorflow:{'loss': 7.449696, 'masked_lm_accuracy': 0.103305794, 'masked_lm_loss': 6.890163, 'next_sentence_accuracy': 0.67499995, 'next_sentence_loss': 0.56874996, 'global_step': 9120} [1,0]::::MLLOG {"namespace": "worker0", "time_ms": 1678508368194, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.10330579429864883, "metadata": {"file": "/root/datasets/training_results_v2.0/Intel-HabanaLabs/benchmarks/bert/implementations/HLS-1H-N32/../TensorFlow/nlp/bert/run_pretraining.py", "lineno": 1262, "epoch_num": 164160, "epoch_count": 20}} [1,0]:per rank mlm accuracy=0.10330579429864883 [1,0]:Total offline non-distributed evaluation time=281.123170375824 seconds [1,0]::::MLLOG {"namespace": "worker0", "time_ms": 1678508368194, "event_type": "INTERVAL_END", "key": "run_stop", "value": 0.10330579429864883, "metadata": {"file": "/root/datasets/training_results_v2.0/Intel-HabanaLabs/benchmarks/bert/implementations/HLS-1H-N32/../TensorFlow/nlp/bert/run_pretraining.py", "lineno": 1303, "epoch_num": 164160, "epoch_count": 20, "status": "fail"}}

Thanks, Rajesh

greg-serochi commented 1 year ago

hi @rajeshitshoulders we'll review this info and if we need the full log file we'll have you post this to the user forum

PurvangL commented 1 year ago

Hi @greg-serochi , I am working with Rajesh on this Bert issue. mentioned Gaudi Readme defines steps to run Bert training on Gaudi, but it uses bookswiki dataset. Combining Tensorflow data preparation from Gaudi2 Readme and packing from Gaudi1 readme, I was able to run training. but as evaluation data in txt file and Gaudi1 training process expects tfrecord format or packed tfrecord format, I am not able to get accuracy of my run.

Could you point out if there is an way to generate wikipedia eval dataset as tfrecord format and then pack it?

Thank you