[BUG] Notebook example multi gpu parallel training using horovod fails

Bug description

Trying to run the notebook example and I keep getting the below error.

Steps/Code to reproduce bug

Running notebook in Databricks Runtime 13.0 ML GPU on a g5.12xlarge instance type (2 workers)
pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com pip install merlin-models nvtabular transformers4rec[pytorch,nvtabular,dataloader]==23.2.0 protobuf==3.20.*

Then launched a cluster

Tried running the notebook example
p.s. I've been trying to run horovod on my own models aside from the example and get the exact same error with the data loader (I printed out the str(MPI_RANK) making sure the correct parquet partitions are being loaded: [1,0]:MPI_RANK is : 0 [1,3]:MPI_RANK is : 3 [1,2]:MPI_RANK is : 2 [1,1]:MPI_RANK is : 1 ....... [1,1]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_1.parquet [1,0]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_0.parquet [1,2]:/dbfs/Workspace/.../tmp//data/processed_nvt/full_dataset_positive_events_train/part_2.parquet

p.s. I have also tried running this without horovod and the model gets trained fine -> seems issue I'm getting is with the data loader when creating the train_loader and valid_loader objects

Expected behavior

Notebook should have run successfully

Environment details

Merlin version: 23.4.0
Platform: Databricks
Python version: 3.10.6
PyTorch version (GPU?): NA
Tensorflow version (GPU?): 2.11.0

Additional context

Error below PFA

[1,1]: File "/Workspace/Repos//merlin-models/examples/usecases/tf_trainer.py", line 59, in [1,0]: print("Number batches: " + str(len(train_loader))) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len [1,0]: return LoaderBase.len(self) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len [1,0]: batches = _num_steps(self._buff_len, self.batch_size) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len [1,0]: self.buff_len = len(self._buff) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff [1,0]: self.buff = ChunkQueue( [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init [1,0]: self.itr = dataloader._data_iter(epochs) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter [1,1]: print("Number batches: " + str(len(train_loader))) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len [1,0]: indices = self._indices_for_process() [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process [1,0]: raise IndexError [1,0]:IndexError [1,1]: return LoaderBase.len(self) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len [1,1]: batches = _num_steps(self._buff_len, self.batch_size) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len [1,1]: self.buff_len = len(self._buff) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff [1,1]: self.buff = ChunkQueue( [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init [1,1]: self.itr = dataloader._data_iter(epochs) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter [1,1]: indices = self._indices_for_process() [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process [1,1]: raise IndexError [1,1]:IndexError

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

NVIDIA-Merlin / models

[BUG] Notebook example multi gpu parallel training using horovod fails #1114