HabanaAI / Model-References

Reference models for Intel(R) Gaudi(R) AI Accelerator
141 stars 67 forks source link

"Training Data Packing" got error - RuntimeError: Maximum number of iterations reached. #40

Open jingkang99 opened 1 month ago

jingkang99 commented 1 month ago

follow the instructions on https://github.com/HabanaAI/Model-References/tree/master/MLPERF3.1/Training/benchmarks

to execute comamnd:

python3 pack_pretraining_data_pytorch.py --input_dir=$PYTORCH_BERT_DATA/hdf5/training-4320/hdf5_4320_shards_uncompressed --output_dir=$PYTORCH_BERT_DATA/packed --max_predictions_per_seq=76 scipy 1.13.0


... Dataset has 156725280 samples Determining packing recipe Begin packing pass Unpacked mean sequence length: 254.43 Found 22102 unique packing strategies.

Iteration: 0: sequences still to pack: 156725280 Traceback (most recent call last): File "/sox/habana-intel/Model-References/MLPERF3.1/Training/benchmarks/bert/implementations/PyTorch/pack_pretraining_data_pytorch.py", line 467, in main() File "/sox/habana-intel/Model-References/MLPERF3.1/Training/benchmarks/bert/implementations/PyTorch/pack_pretraining_data_pytorch.py", line 420, in main strategy_set, mixture, padding, slicing = get_packing_recipe(args.output_dir, sequence_lengths, args.max_sequence_length, args.max_sequences_per_pack) File "/sox/habana-intel/Model-References/MLPERF3.1/Training/benchmarks/bert/implementations/PyTorch/pack_pretraining_data_pytorch.py", line 111, in get_packing_recipe partial_mixture, rnorm = optimize.nnls(np.expand_dims(w0, -1) A, w0 b) File "/opt/python-llm/lib/python3.10/site-packages/scipy/optimize/_nnls.py", line 93, in nnls raise RuntimeError("Maximum number of iterations reached.") RuntimeError: Maximum number of iterations reached.


Training Data Packing image

jingkang99 commented 1 month ago

Tried same steps with latest PyTorch Docker, same error occurs

image

any advice?

Jing1Ling commented 1 week ago

Hi @jingkang99, I update the version of scipy to 1.11.4. It works:).