[BUG] encounter error when running sok dlrm benchmark

Orca-bit commented 1 month ago

Describe the bug

Create train.bin and test.bin following HugeCTR dlrm sample. md5sum is same.
split data using sok preprocessing split_bin.py. replace --slot_size_array with the list in HugeCTR dlrm sample train.py. other arguments are default. is it need to chage default dtype, i.e., int32, for label_raw_type dense_raw_type and category_raw_type?
horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=1 --lr=24

after runing iteration 3790, some errors occur, it looks like something wrong with dataset.

[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 146, in <module>
[1,6]<stderr>:    trainer.train(eval_in_last=False, early_stop=args.early_stop, epochs=args.epochs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 247, in train
[1,6]<stderr>:    auc = evaluate(self._model, self._test_dataset, self._auc_thresholds)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 20, in evaluate
[1,6]<stderr>:    for idx, (samples, labels) in enumerate(dataset):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 152, in __getitem__
[1,6]<stderr>:    return self._prefetch_queue.get().result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[1,6]<stderr>:    return self.__get_result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[1,6]<stderr>:    raise self._exception
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[1,6]<stderr>:    result = self.fn(*self.args, **self.kwargs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 205, in _get
[1,6]<stderr>:    tf.RaggedTensor.from_row_lengths(flat_values, row_lengths[i])
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[1,6]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/check_ops.py", line 485, in _binary_assert
[1,6]<stderr>:    raise errors.InvalidArgumentError(
[1,6]<stderr>:tensorflow.python.framework.errors_impl.InvalidArgumentError: Arguments to _from_row_partition do not form a valid RaggedTensor
[1,6]<stderr>:Condition x == y did not hold.
[1,6]<stderr>:First 1 elements of x:
[1,6]<stderr>:[8192]
[1,6]<stderr>:First 1 elements of y:
[1,6]<stderr>:[2]

To Reproduce Steps to reproduce the behavior:

How to build including docker pull & docker run commands
How to run including the JSON config file used

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: [e.g. Ubuntu xx.yy]
Graphic card: [e.g. a single NVIDIA H100]
CUDA version: [e.g. CUDA 11.x]
Docker image

Additional context Add any other context about the problem here.

kanghui0204 commented 1 month ago

Hi @Orca-bit , is this bug reproducible every time? If so, I will try to reproduce it and then provide you with an answer. Additionally, I will also test the issue mentioned at https://github.com/NVIDIA-Merlin/HugeCTR/issues/463.

Orca-bit commented 1 month ago

@kanghui0204 yes, it is reproducible. By the way, could you share the md5sums of sok split datasets, I have checked md5sums of the hugectr datasets, i.e. train.bin ,test.bin and val.bin.

NVIDIA-Merlin / HugeCTR

[BUG] encounter error when running sok dlrm benchmark #461