I am using the Nvidia Merlin Docker 23.08 Tensorflow container.
I've created my training and validation datasets and saved them into parquet following the standard procedure done with the nvt.workflow.
I am now facing some issues training a two towers model based largely on the examples provided in the notebooks, but with many more list features (such as genres in the MovieLens dataset).
The training starts and the loss function decreases but at the validation step I get an Unknown error that seems to originate from a missing index in the underlying cudf DataFrame, which in turn comes out from a StopIteration when validation data is evaluated.
UnknownError: Graph execution error:
2 root error(s) found.
(0) UNKNOWN: IndexError: single positional indexer is out-of-bounds
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
batch = next(self._batch_itr)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 267, in __call__
ret = func(*args)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 902, in wrapped_generator
for data in generator_fn():
File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 1049, in generator_fn
yield x[i]
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 93, in __getitem__
return self.__next__()
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 97, in __next__
converted_batch = self.convert_batch(super().__next__())
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
return self._get_next_batch()
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 344, in _get_next_batch
batch = next(self._batch_itr)
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 369, in make_tensors
tensors_by_name = self._convert_df_to_tensors(gdf)
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 548, in _convert_df_to_tensors
if isinstance(leaves[0], list):
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 1293, in __getitem__
return self.loc[arg]
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 270, in __getitem__
return self._frame.iloc[arg]
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 187, in __getitem__
data = self._frame._get_elements_from_column(arg)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/single_column_frame.py", line 398, in _get_elements_from_column
return self._column.element_indexing(int(arg))
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 539, in element_indexing
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
[[{{node PyFunc}}]]
[[IteratorGetNext]]
[[retrieval_model_v2_1/parallel_block_5/encoder_2/prepare_features_3/prepare_list_features_3/StatefulPartitionedCall_16/RaggedFromRowSplits/RowPartitionFromRowSplits/assert_non_negative/assert_less_equal/Assert/Assert/data_0/_2484]]
(1) UNKNOWN: IndexError: single positional indexer is out-of-bounds
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
batch = next(self._batch_itr)
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 267, in __call__
ret = func(*args)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 902, in wrapped_generator
for data in generator_fn():
File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 1049, in generator_fn
yield x[i]
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 93, in __getitem__
return self.__next__()
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 97, in __next__
converted_batch = self.convert_batch(super().__next__())
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
return self._get_next_batch()
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 344, in _get_next_batch
batch = next(self._batch_itr)
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 369, in make_tensors
tensors_by_name = self._convert_df_to_tensors(gdf)
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 548, in _convert_df_to_tensors
if isinstance(leaves[0], list):
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 1293, in __getitem__
return self.loc[arg]
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 270, in __getitem__
return self._frame.iloc[arg]
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 187, in __getitem__
data = self._frame._get_elements_from_column(arg)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/single_column_frame.py", line 398, in _get_elements_from_column
return self._column.element_indexing(int(arg))
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 539, in element_indexing
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
[[{{node PyFunc}}]]
[[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_test_function_919093]
I've then tried to run some test iterations on the valid dataset and found with my surprise that even the mm.Loader cannot correctly iterate on the validation dataset.
In other words, I've verified that I cannot consume all the batches from the dataset, unless I set the batch_size to 1 which every number is divisible from.
Indeed, this simple loop raise StopIteration.
for batch in mm.Loader(validation, batch_size=512):
pass
I hope this is something bad on my side. I didn't do the shuffle_by_keys method on the loaded dataset, nor in the phase of its creation. Is this related?
❓ Questions & Help
I am using the Nvidia Merlin Docker 23.08 Tensorflow container.
I've created my training and validation datasets and saved them into parquet following the standard procedure done with the nvt.workflow.
I am now facing some issues training a two towers model based largely on the examples provided in the notebooks, but with many more list features (such as
genres
in the MovieLens dataset). The training starts and the loss function decreases but at the validation step I get anUnknown error
that seems to originate from a missing index in the underlying cudf DataFrame, which in turn comes out from a StopIteration when validation data is evaluated.I've then tried to run some test iterations on the valid dataset and found with my surprise that even the
mm.Loader
cannot correctly iterate on the validation dataset.In other words, I've verified that I cannot consume all the batches from the dataset, unless I set the
batch_size
to 1 which every number is divisible from. Indeed, this simple loop raiseStopIteration
.I hope this is something bad on my side. I didn't do the
shuffle_by_keys
method on the loaded dataset, nor in the phase of its creation. Is this related?