Open LoMarrujo opened 3 months ago
I have been struggling with the same exact issue for the last few days. Example code that i wrote while trying to debug:
import pandas as pd
import nvtabular as nvt
from nvtabular import ops
import cudf
# Sample Data
data = {
'user_id': [16908, 16908, 16908, 16908, 16908],
'item_id': [174, 78, 94, 174, 78],
'timestamp': [
'2024-01-03 14:49:27',
'2024-01-03 15:33:31',
'2024-01-03 16:01:57',
'2024-01-04 18:57:33',
'2024-01-04 18:59:41'
],
'event_type': [
'example1',
'example2',
'example13',
'example4',
'example5'
]
}
df = pd.DataFrame(data)
df['user_id'] = df['user_id'].astype('int64')
df['item_id'] = df['item_id'].astype('int64')
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')
print(df.head())
print(df.dtypes)
cdf = cudf.DataFrame.from_pandas(df)
cat_features = ['item_id'] >> ops.Categorify()
cat_workflow = nvt.Workflow(cat_features)
cat_dataset = nvt.Dataset(cdf)
try:
cat_transformed = cat_workflow.fit_transform(cat_dataset).to_ddf().compute()
print("After Categorify:")
print(cat_transformed.head())
except Exception as e:
print(f"Error during Categorify: {e}")
print("Unique values in item_id:")
print(cdf['item_id'].unique())
Output:
Failed to fit operator <nvtabular.ops.categorify.Categorify object at 0x7fa86ddac1f0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py", line 532, in fit_phase
stats.append(node.op.fit(node.input_columns, transformed_ddf))
File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 400, in fit
dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1551, in _category_stats
return _groupby_to_disk(ddf, _write_uniques, options)
File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1406, in _groupby_to_disk
_grouped_meta = _top_level_groupby(ddf._meta, options=options)
File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1017, in _top_level_groupby
gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py", line 631, in agg
) = self._groupby.aggregate(columns, normalized_aggs)
File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size
user_id item_id timestamp \
0 16908 174 2024-01-03 14:49:27
1 16908 78 2024-01-03 15:33:31
2 16908 94 2024-01-03 16:01:57
3 16908 174 2024-01-04 18:57:33
4 16908 78 2024-01-04 18:59:41
event_type
0 example1
1 example2
2 example3
3 example4
4 example5
user_id int64
item_id int64
timestamp datetime64[ns]
event_type object
dtype: object
Error during Categorify: function is not supported for this dtype: size
Unique values in item_id:
0 174
1 78
2 94
Name: item_id, dtype: int64
same issue
@ohorban can pls you try your pipeline without this line ( pls remove it) :
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')
In our examples, we feed a df to NVT pipelines with integer dytpe
timestamp column, like here.
Error during Categorify: function is not supported for this dtype: size
@anuragreddygv323 can u please provide more details?
Also we need a reproducible example to reproduce your error. thanks.
Cuda 12.1 python 3.11
installed this cudf
pip install \ --extra-index-url=https://pypi.nvidia.com \ cudf-cu12==24.8. dask-cudf-cu12==24.8. cuml-cu12==24.8. \ cugraph-cu12==24.8. cuspatial-cu12==24.8. cuproj-cu12==24.8. \ cuxfilter-cu12==24.8. cucim-cu12==24.8. pylibraft-cu12==24.8. \ raft-dask-cu12==24.8. cuvs-cu12==24.8. nx-cugraph-cu12==24.8.
trying to run transforemer4rec tutorial and when Im trying to categorify its throwing the above error
I ran the example on the documentation and it gives me the same error import cudf import nvtabular as nvt
df = cudf.DataFrame({ 'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'], 'productID': [100, 101, 102, 101, 102, 103, 103], 'label': [0, 0, 1, 1, 1, 0, 0] }) dataset = nvt.Dataset(df)
CATEGORICAL_COLUMNS = ['author', 'productID'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify( freq_threshold={"author": 3, "productID": 2}, num_buckets={"author": 10, "productID": 20})
proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf()
print(ddf.compute())
@anuragreddygv323 we dont support cudf 24.8
(yet). You can use one of our docker images:
this one : nvcr.io/nvidia/merlin/merlin-tensorflow:23.08
or this one: nvcr.io/nvidia/merlin/merlin-tensorflow:nightly
pip install \ --extra-index-url=https://pypi.nvidia.com \ cudf-cu11==23.08
is throwing an error @rnyak
Installing cudf is not enough. you need dask-cudf as well. The cudf and dask-cudf versions in the 23.08 image are as follows:
cudf 23.4.0
dask 2023.1.1
dask-cuda 23.4.0
dask-cudf 23.4.0
I recommend you to use docker images.
Please refer to this page to install cudf: https://docs.rapids.ai/install/#pip
your driver version should be compatible with the cuda version and therefore the cudf version.
You can ask cudf related questions (like installation issues) in the rapids/cudf GH repo.
I tried running NVTabular code related to this and this, but I could not get past the line of code with the Workflow.
The error is:
which occurs after calling Categorify.
Is there something I need to check in order to get NVTabular working? Any additional information from me to solve this issue?
Thanks!