NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
262 stars 50 forks source link

[BUG] Notebook example multi gpu parallel training using horovod fails #1114

Closed vs385 closed 1 year ago

vs385 commented 1 year ago

Bug description

Trying to run the notebook example and I keep getting the below error.

Steps/Code to reproduce bug

  1. Running notebook in Databricks Runtime 13.0 ML GPU on a g5.12xlarge instance type (2 workers)
  2. pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com pip install merlin-models nvtabular transformers4rec[pytorch,nvtabular,dataloader]==23.2.0 protobuf==3.20.*

Then launched a cluster

  1. Tried running the notebook example
  2. p.s. I've been trying to run horovod on my own models aside from the example and get the exact same error with the data loader (I printed out the str(MPI_RANK) making sure the correct parquet partitions are being loaded: [1,0]:MPI_RANK is : 0 [1,3]:MPI_RANK is : 3 [1,2]:MPI_RANK is : 2 [1,1]:MPI_RANK is : 1 ....... [1,1]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_1.parquet [1,0]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_0.parquet [1,2]:/dbfs/Workspace/.../tmp//data/processed_nvt/full_dataset_positive_events_train/part_2.parquet

p.s. I have also tried running this without horovod and the model gets trained fine -> seems issue I'm getting is with the data loader when creating the train_loader and valid_loader objects

nvidia-smi -----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 | | 0% 25C P8 19W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 | | 0% 25C P8 16W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 | | 0% 24C P8 15W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | | 0% 25C P8 16W / 300W | 0MiB / 22731MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found

Expected behavior

Notebook should have run successfully

Environment details

Additional context

Error below PFA

[1,1]: File "/Workspace/Repos//merlin-models/examples/usecases/tf_trainer.py", line 59, in [1,0]: print("Number batches: " + str(len(train_loader))) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len [1,0]: return LoaderBase.len(self) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len [1,0]: batches = _num_steps(self._buff_len, self.batch_size) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len [1,0]: self.buff_len = len(self._buff) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff [1,0]: self.buff = ChunkQueue( [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init [1,0]: self.itr = dataloader._data_iter(epochs) [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter [1,1]: print("Number batches: " + str(len(train_loader))) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len [1,0]: indices = self._indices_for_process() [1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process [1,0]: raise IndexError [1,0]:IndexError [1,1]: return LoaderBase.len(self) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len [1,1]: batches = _num_steps(self._buff_len, self.batch_size) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len [1,1]: self.buff_len = len(self._buff) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff [1,1]: self.buff = ChunkQueue( [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init [1,1]: self.itr = dataloader._data_iter(epochs) [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter [1,1]: indices = self._indices_for_process() [1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process [1,1]: raise IndexError [1,1]:IndexError

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[32806,1],0] Exit code: 1