NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] Reading parquet dataset on GPU throws "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error #1873

Open orlev2 opened 8 months ago

orlev2 commented 8 months ago

Describe the bug Reading parquet dataset on GPU throws an "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error. Reading the data on CPU runs successfully:

distributed.worker - WARNING - Compute Failed
Key:       _sample_row_group-1d48e09b-5b56-4e62-92ec-860ff2f9dd40
Function:  execute_task
args:      ((<function apply at 0x7f6b6e47f010>, <function _sample_row_group at 0x7f692006f130>, ['path/to/parquet_files/000000000000.parquet', <gcsfs.core.GCSFileSystem object at 0x7f6ae65ec6d0>], (<class 'dict'>, [['cpu', False], ['memory_usage', True]])))
kwargs:    {}
Exception: 'ValueError("cudf engine doesn\'t support the following keyword arguments: [\'strings_to_categorical\']")'

Steps/Code to reproduce bug

dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=False)
# Fails with ValueError: cudf engine doesn't support the following keyword arguments: ['strings_to_categorical'] 

dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=True)
# Runs successfully

Expected behavior The dataset should be read from file under both cpu=True/False

Environment details (please complete the following information):

nvtabular == 23.8.00 cudf == 23.10.02 (above error was also present under 23.12.01) dask == 2023.9.2

@niraj06

orlev2 commented 8 months ago

The following workaround works in loading the data:

dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(
    dask_cudf.read_parquet(dataset), engine='parquet', cpu=False
)
# <merlin.io.dataset.Dataset at 0x7f99bc042230>

However, applying workflow transform fails:

workflow = nvt.Workflow.load(f"path/to/workflow")
workflow.transform(dataset_nvt)
# Exception: "TypeError('String Arrays is not yet implemented in cudf')"

full error:

Key:       ('transform-bdc9b5878b9eff9e4e8eb287f652e68a', 63)
Function:  subgraph_callable-6a50eb3e-1830-40d8-bff7-0a6db4e7
args:      ([<Node SelectionOp>], 'read-parquet-070e46c56ae3f13e04d07d8cae7b3f14', {'piece': ('path/to/parquet_files/000000000000.parquet', None, None)})
kwargs:    {}
Exception: "TypeError('String Arrays is not yet implemented in cudf')"

The workflow includes nvt.ops.Categorify and nvt.ops.Groupby operations to create a string array of sequential events per grouped entity.

rjzamora commented 6 months ago

Sorry for this ridiculously late response @orlev2 - Just coming across this now.

As far as I can tell, the rapids/dask pinning in Merlin has been far too loose. NVTabular 23.8 was definitely not tested with cudf>=23.08 or dask>=2023.8.

The merlin 23.08 containers use cudf-23.04 (which uses dask-2023.1.1), so using that is your best bet.

NOTE: The lack of upper pinning in NVTabular is indeed a "bug" of sorts - I apologize about that.