NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
262 stars 50 forks source link

[QST] `StopIteration` error in `model.fit` at validation time #1240

Open CarloNicolini opened 5 months ago

CarloNicolini commented 5 months ago

❓ Questions & Help

I am using the Nvidia Merlin Docker 23.08 Tensorflow container.

I've created my training and validation datasets and saved them into parquet following the standard procedure done with the nvt.workflow.

I am now facing some issues training a two towers model based largely on the examples provided in the notebooks, but with many more list features (such as genres in the MovieLens dataset). The training starts and the loss function decreases but at the validation step I get an Unknown error that seems to originate from a missing index in the underlying cudf DataFrame, which in turn comes out from a StopIteration when validation data is evaluated.

UnknownError: Graph execution error:

2 root error(s) found.
  (0) UNKNOWN:  IndexError: single positional indexer is out-of-bounds
Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
    batch = next(self._batch_itr)

StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 267, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 902, in wrapped_generator
    for data in generator_fn():

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 1049, in generator_fn
    yield x[i]

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 93, in __getitem__
    return self.__next__()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 97, in __next__
    converted_batch = self.convert_batch(super().__next__())

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
    return self._get_next_batch()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 344, in _get_next_batch
    batch = next(self._batch_itr)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 369, in make_tensors
    tensors_by_name = self._convert_df_to_tensors(gdf)

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 548, in _convert_df_to_tensors
    if isinstance(leaves[0], list):

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 1293, in __getitem__
    return self.loc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 270, in __getitem__
    return self._frame.iloc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 187, in __getitem__
    data = self._frame._get_elements_from_column(arg)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/single_column_frame.py", line 398, in _get_elements_from_column
    return self._column.element_indexing(int(arg))

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 539, in element_indexing
    raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds

     [[{{node PyFunc}}]]
     [[IteratorGetNext]]
     [[retrieval_model_v2_1/parallel_block_5/encoder_2/prepare_features_3/prepare_list_features_3/StatefulPartitionedCall_16/RaggedFromRowSplits/RowPartitionFromRowSplits/assert_non_negative/assert_less_equal/Assert/Assert/data_0/_2484]]
  (1) UNKNOWN:  IndexError: single positional indexer is out-of-bounds
Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 332, in _get_next_batch
    batch = next(self._batch_itr)

StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/script_ops.py", line 267, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/data/ops/from_generator_op.py", line 198, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 902, in wrapped_generator
    for data in generator_fn():

  File "/usr/local/lib/python3.10/dist-packages/keras/engine/data_adapter.py", line 1049, in generator_fn
    yield x[i]

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 93, in __getitem__
    return self.__next__()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/tensorflow.py", line 97, in __next__
    converted_batch = self.convert_batch(super().__next__())

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 261, in __next__
    return self._get_next_batch()

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 344, in _get_next_batch
    batch = next(self._batch_itr)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 369, in make_tensors
    tensors_by_name = self._convert_df_to_tensors(gdf)

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/merlin/dataloader/loader_base.py", line 548, in _convert_df_to_tensors
    if isinstance(leaves[0], list):

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 1293, in __getitem__
    return self.loc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 270, in __getitem__
    return self._frame.iloc[arg]

  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 115, in inner
    result = func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/series.py", line 187, in __getitem__
    data = self._frame._get_elements_from_column(arg)

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/single_column_frame.py", line 398, in _get_elements_from_column
    return self._column.element_indexing(int(arg))

  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 539, in element_indexing
    raise IndexError("single positional indexer is out-of-bounds")

IndexError: single positional indexer is out-of-bounds

     [[{{node PyFunc}}]]
     [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_test_function_919093]

I've then tried to run some test iterations on the valid dataset and found with my surprise that even the mm.Loader cannot correctly iterate on the validation dataset.

In other words, I've verified that I cannot consume all the batches from the dataset, unless I set the batch_size to 1 which every number is divisible from. Indeed, this simple loop raise StopIteration.

for batch in mm.Loader(validation, batch_size=512):
    pass

I hope this is something bad on my side. I didn't do the shuffle_by_keys method on the loaded dataset, nor in the phase of its creation. Is this related?

rnyak commented 4 months ago

@CarloNicolini please provide a minimal reproducible example so that we can run and reproduce the issue you are facing.