[BUG] Reading parquet dataset on GPU throws "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error

orlev2 commented 8 months ago

Describe the bug Reading parquet dataset on GPU throws an "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error. Reading the data on CPU runs successfully:

distributed.worker - WARNING - Compute Failed
Key:       _sample_row_group-1d48e09b-5b56-4e62-92ec-860ff2f9dd40
Function:  execute_task
args:      ((<function apply at 0x7f6b6e47f010>, <function _sample_row_group at 0x7f692006f130>, ['path/to/parquet_files/000000000000.parquet', <gcsfs.core.GCSFileSystem object at 0x7f6ae65ec6d0>], (<class 'dict'>, [['cpu', False], ['memory_usage', True]])))
kwargs:    {}
Exception: 'ValueError("cudf engine doesn\'t support the following keyword arguments: [\'strings_to_categorical\']")'

Steps/Code to reproduce bug

dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=False)
# Fails with ValueError: cudf engine doesn't support the following keyword arguments: ['strings_to_categorical'] 

dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=True)
# Runs successfully

Expected behavior The dataset should be read from file under both cpu=True/False

Environment details (please complete the following information):

Environment location: GCP vertex ai notebook (GPU: NVIDIA V100 x 1)
Method of NVTabular install: conda

nvtabular == 23.8.00 cudf == 23.10.02 (above error was also present under 23.12.01) dask == 2023.9.2

@niraj06

orlev2 commented 8 months ago

The following workaround works in loading the data:

dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(
    dask_cudf.read_parquet(dataset), engine='parquet', cpu=False
)
# <merlin.io.dataset.Dataset at 0x7f99bc042230>

However, applying workflow transform fails:

workflow = nvt.Workflow.load(f"path/to/workflow")
workflow.transform(dataset_nvt)
# Exception: "TypeError('String Arrays is not yet implemented in cudf')"

full error:

Key:       ('transform-bdc9b5878b9eff9e4e8eb287f652e68a', 63)
Function:  subgraph_callable-6a50eb3e-1830-40d8-bff7-0a6db4e7
args:      ([<Node SelectionOp>], 'read-parquet-070e46c56ae3f13e04d07d8cae7b3f14', {'piece': ('path/to/parquet_files/000000000000.parquet', None, None)})
kwargs:    {}
Exception: "TypeError('String Arrays is not yet implemented in cudf')"

The workflow includes nvt.ops.Categorify and nvt.ops.Groupby operations to create a string array of sequential events per grouped entity.

rjzamora commented 6 months ago

Sorry for this ridiculously late response @orlev2 - Just coming across this now.

As far as I can tell, the rapids/dask pinning in Merlin has been far too loose. NVTabular 23.8 was definitely not tested with cudf>=23.08 or dask>=2023.8.

The merlin 23.08 containers use cudf-23.04 (which uses dask-2023.1.1), so using that is your best bet.

NOTE: The lack of upper pinning in NVTabular is indeed a "bug" of sorts - I apologize about that.

NVIDIA-Merlin / NVTabular

[BUG] Reading parquet dataset on GPU throws "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error #1873