NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[QST] NVTabular function is not supported for this dtype: size #1880

Open LoMarrujo opened 3 months ago

LoMarrujo commented 3 months ago

I tried running NVTabular code related to this and this, but I could not get past the line of code with the Workflow.

The error is:

File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate TypeError: function is not supported for this dtype: size

which occurs after calling Categorify.

Is there something I need to check in order to get NVTabular working? Any additional information from me to solve this issue?

Thanks!

ohorban commented 3 months ago

I have been struggling with the same exact issue for the last few days. Example code that i wrote while trying to debug:

import pandas as pd
import nvtabular as nvt
from nvtabular import ops
import cudf

# Sample Data
data = {
    'user_id': [16908, 16908, 16908, 16908, 16908],
    'item_id': [174, 78, 94, 174, 78],
    'timestamp': [
        '2024-01-03 14:49:27',
        '2024-01-03 15:33:31',
        '2024-01-03 16:01:57',
        '2024-01-04 18:57:33',
        '2024-01-04 18:59:41'
    ],
    'event_type': [
        'example1',
        'example2',
        'example13',
        'example4',
        'example5'
    ]
}
df = pd.DataFrame(data)
df['user_id'] = df['user_id'].astype('int64')
df['item_id'] = df['item_id'].astype('int64')
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')
print(df.head())
print(df.dtypes)

cdf = cudf.DataFrame.from_pandas(df)

cat_features = ['item_id'] >> ops.Categorify()

cat_workflow = nvt.Workflow(cat_features)
cat_dataset = nvt.Dataset(cdf)

try:
    cat_transformed = cat_workflow.fit_transform(cat_dataset).to_ddf().compute()
    print("After Categorify:")
    print(cat_transformed.head())
except Exception as e:
    print(f"Error during Categorify: {e}")

print("Unique values in item_id:")
print(cdf['item_id'].unique())

Output:

Failed to fit operator <nvtabular.ops.categorify.Categorify object at 0x7fa86ddac1f0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py", line 532, in fit_phase
    stats.append(node.op.fit(node.input_columns, transformed_ddf))
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 400, in fit
    dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1551, in _category_stats
    return _groupby_to_disk(ddf, _write_uniques, options)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1406, in _groupby_to_disk
    _grouped_meta = _top_level_groupby(ddf._meta, options=options)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1017, in _top_level_groupby
    gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py", line 631, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size

  user_id  item_id           timestamp  \
0    16908      174 2024-01-03 14:49:27   
1    16908       78 2024-01-03 15:33:31   
2    16908       94 2024-01-03 16:01:57   
3    16908      174 2024-01-04 18:57:33   
4    16908       78 2024-01-04 18:59:41   

                                          event_type  
0  example1
1  example2
2  example3
3  example4
4  example5
user_id                int64
item_id                int64
timestamp     datetime64[ns]
event_type            object
dtype: object
Error during Categorify: function is not supported for this dtype: size
Unique values in item_id:
0    174
1     78
2     94
Name: item_id, dtype: int64
Chevolier commented 2 months ago

same issue

rnyak commented 2 months ago

@ohorban can pls you try your pipeline without this line ( pls remove it) : df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')

In our examples, we feed a df to NVT pipelines with integer dytpe timestamp column, like here.

anuragreddygv323 commented 1 month ago

Error during Categorify: function is not supported for this dtype: size

rnyak commented 1 month ago

@anuragreddygv323 can u please provide more details?

Also we need a reproducible example to reproduce your error. thanks.

anuragreddygv323 commented 1 month ago

Cuda 12.1 python 3.11

installed this cudf

pip install \ --extra-index-url=https://pypi.nvidia.com \ cudf-cu12==24.8. dask-cudf-cu12==24.8. cuml-cu12==24.8. \ cugraph-cu12==24.8. cuspatial-cu12==24.8. cuproj-cu12==24.8. \ cuxfilter-cu12==24.8. cucim-cu12==24.8. pylibraft-cu12==24.8. \ raft-dask-cu12==24.8. cuvs-cu12==24.8. nx-cugraph-cu12==24.8.

trying to run transforemer4rec tutorial and when Im trying to categorify its throwing the above error

I ran the example on the documentation and it gives me the same error import cudf import nvtabular as nvt

Create toy dataset

df = cudf.DataFrame({ 'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'], 'productID': [100, 101, 102, 101, 102, 103, 103], 'label': [0, 0, 1, 1, 1, 0, 0] }) dataset = nvt.Dataset(df)

Define pipeline

CATEGORICAL_COLUMNS = ['author', 'productID'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify( freq_threshold={"author": 3, "productID": 2}, num_buckets={"author": 10, "productID": 20})

Initialize the workflow and execute it

proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf()

Print results

print(ddf.compute())

rnyak commented 1 month ago

@anuragreddygv323 we dont support cudf 24.8 (yet). You can use one of our docker images:

this one : nvcr.io/nvidia/merlin/merlin-tensorflow:23.08 or this one: nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

anuragreddygv323 commented 1 month ago

pip install \ --extra-index-url=https://pypi.nvidia.com \ cudf-cu11==23.08

is throwing an error @rnyak 
rnyak commented 1 month ago

Installing cudf is not enough. you need dask-cudf as well. The cudf and dask-cudf versions in the 23.08 image are as follows:

cudf 23.4.0 dask 2023.1.1 dask-cuda 23.4.0 dask-cudf 23.4.0

I recommend you to use docker images.

Please refer to this page to install cudf: https://docs.rapids.ai/install/#pip

your driver version should be compatible with the cuda version and therefore the cudf version.

You can ask cudf related questions (like installation issues) in the rapids/cudf GH repo.