p.s. I've been trying to run horovod on my own models aside from the example and get the exact same error with the data loader (I printed out the str(MPI_RANK) making sure the correct parquet partitions are being loaded:
[1,0]:MPI_RANK is : 0
[1,3]:MPI_RANK is : 3
[1,2]:MPI_RANK is : 2
[1,1]:MPI_RANK is : 1
.......
[1,1]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_1.parquet
[1,0]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_0.parquet
[1,2]:/dbfs/Workspace/.../tmp//data/processed_nvt/full_dataset_positive_events_train/part_2.parquet
p.s. I have also tried running this without horovod and the model gets trained fine -> seems issue I'm getting is with the data loader when creating the train_loader and valid_loader objects
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found
Expected behavior
Notebook should have run successfully
Environment details
Merlin version: 23.4.0
Platform: Databricks
Python version: 3.10.6
PyTorch version (GPU?): NA
Tensorflow version (GPU?): 2.11.0
Additional context
Error below PFA
[1,1]: File "/Workspace/Repos//merlin-models/examples/usecases/tf_trainer.py", line 59, in
[1,0]: print("Number batches: " + str(len(train_loader)))
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len
[1,0]: return LoaderBase.len(self)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len
[1,0]: batches = _num_steps(self._buff_len, self.batch_size)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,0]: self.buff_len = len(self._buff)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,0]: self.buff = ChunkQueue(
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init
[1,0]: self.itr = dataloader._data_iter(epochs)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]: print("Number batches: " + str(len(train_loader)))
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len
[1,0]: indices = self._indices_for_process()
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,0]: raise IndexError
[1,0]:IndexError
[1,1]: return LoaderBase.len(self)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len
[1,1]: batches = _num_steps(self._buff_len, self.batch_size)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,1]: self.buff_len = len(self._buff)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,1]: self.buff = ChunkQueue(
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init
[1,1]: self.itr = dataloader._data_iter(epochs)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]: indices = self._indices_for_process()
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,1]: raise IndexError
[1,1]:IndexError
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Bug description
Trying to run the notebook example and I keep getting the below error.
Steps/Code to reproduce bug
Then launched a cluster
p.s. I have also tried running this without horovod and the model gets trained fine -> seems issue I'm getting is with the data loader when creating the train_loader and valid_loader objects
nvidia-smi -----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 | | 0% 25C P8 19W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 | | 0% 25C P8 16W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 | | 0% 24C P8 15W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | | 0% 25C P8 16W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found
Expected behavior
Notebook should have run successfully
Environment details
Additional context
Error below PFA
[1,1]: File "/Workspace/Repos//merlin-models/examples/usecases/tf_trainer.py", line 59, in
[1,0]: print("Number batches: " + str(len(train_loader)))
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len
[1,0]: return LoaderBase.len(self)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len
[1,0]: batches = _num_steps(self._buff_len, self.batch_size)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,0]: self.buff_len = len(self._buff)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,0]: self. buff = ChunkQueue(
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init
[1,0]: self.itr = dataloader._data_iter(epochs)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]: print("Number batches: " + str(len(train_loader)))
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len
[1,0]: indices = self._indices_for_process()
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,0]: raise IndexError
[1,0]:IndexError
[1,1]: return LoaderBase.len(self)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len
[1,1]: batches = _num_steps(self._buff_len, self.batch_size)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,1]: self.buff_len = len(self._buff)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,1]: self. buff = ChunkQueue(
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init
[1,1]: self.itr = dataloader._data_iter(epochs)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]: indices = self._indices_for_process()
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,1]: raise IndexError
[1,1]:IndexError
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[32806,1],0] Exit code: 1