NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
772
stars
118
forks
source link
[BUG] TypeError: function is not supported for this dtype: size at getting-started-movielens #1103
I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.
Steps/Code to reproduce bug
Set Up Environment
I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook.
I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises.
Modify the Code:
In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example:
python
INPUT_DATA_DIR = '/kaggle/working/data'
All other code remains unchanged.
Run the Notebook
I executed the notebook cells in sequence.
The error occurs when running the following line:
python
workflow.fit(train_dataset)
The error message received is: TypeError: function is not supported for this dtype: size.
File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset)
199 def fit(self, dataset: Dataset) -> "Workflow":
200 """Calculates statistics for this workflow on the input dataset
201
202 Parameters
(...)
211 This Workflow with statistics calculated on it
212 """
--> 213 self.executor.fit(dataset, self.graph)
214 return self
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit)
462 if not current_phase:
463 # this shouldn't happen, but lets not infinite loop just in case
464 raise RuntimeError("failed to find dependency-free StatOperator to fit")
--> 466 self.fit_phase(dataset, current_phase)
468 # Remove all the operators we processed in this phase, and remove
469 # from the dependencies of other ops too
470 for node in current_phase:
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict)
530 stats.append(node.op.fit(node.input_columns, Dataset(ddf)))
531 else:
--> 532 stats.append(node.op.fit(node.input_columns, transformed_ddf))
533 except Exception:
534 LOG.exception("Failed to fit operator %s", node.op)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, kwargs)
113 @wraps(func)
114 def inner(*args, *kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(args, kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf)
391 # Define a rough row-count at which we are likely to
392 # start hitting memory-pressure issues that cannot
393 # be accommodated with smaller partition sizes.
394 # By default, we estimate a "problematic" cardinality
395 # to be one that consumes >12.5% of the total memory.
396 self.cardinality_memory_limit = parse_bytes(
397 self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125)
398 )
--> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
401 return Delayed(key, dsk)
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options)
1549 if options.agg_cols == [] and options.agg_list == []:
1550 options.agg_list = ["size"]
-> 1551 return _groupby_to_disk(ddf, _write_uniques, options)
1553 # Otherwise, getting category-statistics
1554 if isinstance(options.agg_cols, str):
File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func)
619 orig_dtypes = tuple(c.dtype for c in columns)
621 # Note: When there are no key columns, the below produces
622 # an Index with float64 dtype, while Pandas returns
623 # an Index with int64 dtype.
624 # (GH: 6945)
625 (
626 result_columns,
627 grouped_key_cols,
628 included_aggregations,
--> 629 ) = self._groupby.aggregate(columns, normalized_aggs)
631 result_index = self.grouping.keys._from_columns_like_self(
632 grouped_key_cols,
633 )
635 multilevel = _is_multi_agg(func)
File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()
TypeError: function is not supported for this dtype: size
Bug description
I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.
Steps/Code to reproduce bug
Set Up Environment I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook. I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises.
Modify the Code: In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example: python
All other code remains unchanged.
The error message received is: TypeError: function is not supported for this dtype: size.
Expected behavior
Could you please assist in resolving this issue?
Environment Details
Additional context
Full Error
TypeError Traceback (most recent call last) File:1
File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset) 199 def fit(self, dataset: Dataset) -> "Workflow": 200 """Calculates statistics for this workflow on the input dataset 201 202 Parameters (...) 211 This Workflow with statistics calculated on it 212 """ --> 213 self.executor.fit(dataset, self.graph) 214 return self
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit) 462 if not current_phase: 463 # this shouldn't happen, but lets not infinite loop just in case 464 raise RuntimeError("failed to find dependency-free StatOperator to fit") --> 466 self.fit_phase(dataset, current_phase) 468 # Remove all the operators we processed in this phase, and remove 469 # from the dependencies of other ops too 470 for node in current_phase:
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict) 530 stats.append(node.op.fit(node.input_columns, Dataset(ddf))) 531 else: --> 532 stats.append(node.op.fit(node.input_columns, transformed_ddf)) 533 except Exception: 534 LOG.exception("Failed to fit operator %s", node.op)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, kwargs)
113 @wraps(func)
114 def inner(*args, *kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(args, kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf) 391 # Define a rough row-count at which we are likely to 392 # start hitting memory-pressure issues that cannot 393 # be accommodated with smaller partition sizes. 394 # By default, we estimate a "problematic" cardinality 395 # to be one that consumes >12.5% of the total memory. 396 self.cardinality_memory_limit = parse_bytes( 397 self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125) 398 ) --> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns)) 401 return Delayed(key, dsk)
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options) 1549 if options.agg_cols == [] and options.agg_list == []: 1550 options.agg_list = ["size"] -> 1551 return _groupby_to_disk(ddf, _write_uniques, options) 1553 # Otherwise, getting category-statistics 1554 if isinstance(options.agg_cols, str):
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1406, in _groupby_to_disk(ddf, write_func, options) 1402 # Use map_partitions to improve task fusion 1403 grouped = ddf.to_bag(format="frame").map_partitions( 1404 _top_level_groupby, options=options, token="level_1" 1405 ) -> 1406 _grouped_meta = _top_level_groupby(ddf._meta, options=options) 1407 _grouped_meta_col = {} 1409 dsk_split = defaultdict(dict)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, kwargs)
113 @wraps(func)
114 def inner(*args, *kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(args, kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1017, in _top_level_groupby(df, options, spill) 1015 df_gb = _maybe_flatten_list_column(cat_col_selector.names[0], df_gb) 1016 # NOTE: groupby(..., dropna=False) requires pandas>=1.1.0 -> 1017 gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict) 1018 gb.columns = [ 1019 _make_name((tuple(cat_col_selector.names) + name[1:]), sep=options.name_sep) 1020 if name[0] == cat_col_selector.names[0] 1021 else _make_name((tuple(cat_col_selector.names) + name), sep=options.name_sep) 1022 for name in gb.columns.to_flat_index() 1023 ] 1024 gb.reset_index(inplace=True, drop=False)
File /opt/conda/lib/python3.10/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking..wrapper(*args, *kwargs)
43 if nvtx.enabled():
44 stack.enter_context(
45 nvtx.annotate(
46 message=func.qualname,
(...)
49 )
50 )
---> 51 return func(args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func) 619 orig_dtypes = tuple(c.dtype for c in columns) 621 # Note: When there are no key columns, the below produces 622 # an Index with float64 dtype, while Pandas returns 623 # an Index with int64 dtype. 624 # (GH: 6945) 625 ( 626 result_columns, 627 grouped_key_cols, 628 included_aggregations, --> 629 ) = self._groupby.aggregate(columns, normalized_aggs) 631 result_index = self.grouping.keys._from_columns_like_self( 632 grouped_key_cols, 633 ) 635 multilevel = _is_multi_agg(func)
File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()
TypeError: function is not supported for this dtype: size