NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

Index error with Categorify on transform step for columns with 100% NaNs #1865

Open lecardozo opened 1 year ago

lecardozo commented 1 year ago

I was running a workflow.transform(sampled_dataset) step on a sample of my inference dataset and received the following error

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 510, in transform
    encoded = _encode(
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 1707, in _encode
    if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1625, in _getitem_axis
    self._validate_integer(key, axis)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1557, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/merlin/dag/executors.py", line 237, in _run_node_transform
    transformed_data = node.op.transform(selection, input_data)
  File "/databricks/python/lib/python3.8/site-packages/merlin/core/dispatch.py", line 69, in inner2
    return func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 534, in transform
    raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column my_categorical_column

I noticed this happens when the dataset to be transformed has a categorical column (my_categorical_column) with 100% NaNs. It looks like that happens when this line is reached 👇 where we do a dropna() followed by iloc[0]

https://github.com/NVIDIA-Merlin/NVTabular/blob/ee21af08dc7def662a661dbea35957af43e91a09/nvtabular/ops/categorify.py#L1707

It's not a huge blocker for me right now, as this mostly happens on dataset samples, but I'm wondering whether that behavior is expected. Any thoughts? 😃