NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k
stars
143
forks
source link
Index error with Categorify on transform step for columns with 100% NaNs #1865
I was running a workflow.transform(sampled_dataset) step on a sample of my inference dataset and received the following error
Traceback (most recent call last):
File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 510, in transform
encoded = _encode(
File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 1707, in _encode
if isinstance(df[cl].dropna().iloc[0], (np.ndarray, list)):
File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1625, in _getitem_axis
self._validate_integer(key, axis)
File "/databricks/python/lib/python3.8/site-packages/pandas/core/indexing.py", line 1557, in _validate_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/python/lib/python3.8/site-packages/merlin/dag/executors.py", line 237, in _run_node_transform
transformed_data = node.op.transform(selection, input_data)
File "/databricks/python/lib/python3.8/site-packages/merlin/core/dispatch.py", line 69, in inner2
return func(*args, **kwargs)
File "/databricks/python/lib/python3.8/site-packages/nvtabular/ops/categorify.py", line 534, in transform
raise RuntimeError(f"Failed to categorical encode column {name}") from e
RuntimeError: Failed to categorical encode column my_categorical_column
I noticed this happens when the dataset to be transformed has a categorical column (my_categorical_column) with 100% NaNs. It looks like that happens when this line is reached 👇 where we do a dropna() followed by iloc[0]
It's not a huge blocker for me right now, as this mostly happens on dataset samples, but I'm wondering whether that behavior is expected. Any thoughts? 😃
I was running a
workflow.transform(sampled_dataset)
step on a sample of my inference dataset and received the following errorI noticed this happens when the dataset to be transformed has a categorical column (
my_categorical_column
) with 100% NaNs. It looks like that happens when this line is reached 👇 where we do adropna()
followed byiloc[0]
https://github.com/NVIDIA-Merlin/NVTabular/blob/ee21af08dc7def662a661dbea35957af43e91a09/nvtabular/ops/categorify.py#L1707
It's not a huge blocker for me right now, as this mostly happens on dataset samples, but I'm wondering whether that behavior is expected. Any thoughts? 😃