NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
756 stars 113 forks source link

[BUG] TypeError: function is not supported for this dtype: size at getting-started-movielens #1103

Open qrcodeTH opened 3 weeks ago

qrcodeTH commented 3 weeks ago

Bug description

I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.

Steps/Code to reproduce bug

  1. Set Up Environment I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook. I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises.

  2. Modify the Code: In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example: python

INPUT_DATA_DIR = '/kaggle/working/data'

All other code remains unchanged.

  1. Run the Notebook I executed the notebook cells in sequence. The error occurs when running the following line: python
workflow.fit(train_dataset)

The error message received is: TypeError: function is not supported for this dtype: size.

Expected behavior

Could you please assist in resolving this issue?

Environment Details

Additional context

Full Error


TypeError Traceback (most recent call last) File :1

File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset) 199 def fit(self, dataset: Dataset) -> "Workflow": 200 """Calculates statistics for this workflow on the input dataset 201 202 Parameters (...) 211 This Workflow with statistics calculated on it 212 """ --> 213 self.executor.fit(dataset, self.graph) 214 return self

File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit) 462 if not current_phase: 463 # this shouldn't happen, but lets not infinite loop just in case 464 raise RuntimeError("failed to find dependency-free StatOperator to fit") --> 466 self.fit_phase(dataset, current_phase) 468 # Remove all the operators we processed in this phase, and remove 469 # from the dependencies of other ops too 470 for node in current_phase:

File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict) 530 stats.append(node.op.fit(node.input_columns, Dataset(ddf))) 531 else: --> 532 stats.append(node.op.fit(node.input_columns, transformed_ddf)) 533 except Exception: 534 LOG.exception("Failed to fit operator %s", node.op)

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, kwargs) 113 @wraps(func) 114 def inner(*args, *kwargs): 115 libnvtx_push_range(self.attributes, self.domain.handle) --> 116 result = func(args, kwargs) 117 libnvtx_pop_range(self.domain.handle) 118 return result

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf) 391 # Define a rough row-count at which we are likely to 392 # start hitting memory-pressure issues that cannot 393 # be accommodated with smaller partition sizes. 394 # By default, we estimate a "problematic" cardinality 395 # to be one that consumes >12.5% of the total memory. 396 self.cardinality_memory_limit = parse_bytes( 397 self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125) 398 ) --> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns)) 401 return Delayed(key, dsk)

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options) 1549 if options.agg_cols == [] and options.agg_list == []: 1550 options.agg_list = ["size"] -> 1551 return _groupby_to_disk(ddf, _write_uniques, options) 1553 # Otherwise, getting category-statistics 1554 if isinstance(options.agg_cols, str):

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1406, in _groupby_to_disk(ddf, write_func, options) 1402 # Use map_partitions to improve task fusion 1403 grouped = ddf.to_bag(format="frame").map_partitions( 1404 _top_level_groupby, options=options, token="level_1" 1405 ) -> 1406 _grouped_meta = _top_level_groupby(ddf._meta, options=options) 1407 _grouped_meta_col = {} 1409 dsk_split = defaultdict(dict)

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, kwargs) 113 @wraps(func) 114 def inner(*args, *kwargs): 115 libnvtx_push_range(self.attributes, self.domain.handle) --> 116 result = func(args, kwargs) 117 libnvtx_pop_range(self.domain.handle) 118 return result

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1017, in _top_level_groupby(df, options, spill) 1015 df_gb = _maybe_flatten_list_column(cat_col_selector.names[0], df_gb) 1016 # NOTE: groupby(..., dropna=False) requires pandas>=1.1.0 -> 1017 gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict) 1018 gb.columns = [ 1019 _make_name((tuple(cat_col_selector.names) + name[1:]), sep=options.name_sep) 1020 if name[0] == cat_col_selector.names[0] 1021 else _make_name((tuple(cat_col_selector.names) + name), sep=options.name_sep) 1022 for name in gb.columns.to_flat_index() 1023 ] 1024 gb.reset_index(inplace=True, drop=False)

File /opt/conda/lib/python3.10/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking..wrapper(*args, *kwargs) 43 if nvtx.enabled(): 44 stack.enter_context( 45 nvtx.annotate( 46 message=func.qualname, (...) 49 ) 50 ) ---> 51 return func(args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func) 619 orig_dtypes = tuple(c.dtype for c in columns) 621 # Note: When there are no key columns, the below produces 622 # an Index with float64 dtype, while Pandas returns 623 # an Index with int64 dtype. 624 # (GH: 6945) 625 ( 626 result_columns, 627 grouped_key_cols, 628 included_aggregations, --> 629 ) = self._groupby.aggregate(columns, normalized_aggs) 631 result_index = self.grouping.keys._from_columns_like_self( 632 grouped_key_cols, 633 ) 635 multilevel = _is_multi_agg(func)

File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()

TypeError: function is not supported for this dtype: size

adidwd commented 1 week ago

I have same issue. Did you find any resolve? I am using the data and code as per the github links only