Open orlev2 opened 8 months ago
The following workaround works in loading the data:
dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(
dask_cudf.read_parquet(dataset), engine='parquet', cpu=False
)
# <merlin.io.dataset.Dataset at 0x7f99bc042230>
However, applying workflow transform fails:
workflow = nvt.Workflow.load(f"path/to/workflow")
workflow.transform(dataset_nvt)
# Exception: "TypeError('String Arrays is not yet implemented in cudf')"
full error:
Key: ('transform-bdc9b5878b9eff9e4e8eb287f652e68a', 63)
Function: subgraph_callable-6a50eb3e-1830-40d8-bff7-0a6db4e7
args: ([<Node SelectionOp>], 'read-parquet-070e46c56ae3f13e04d07d8cae7b3f14', {'piece': ('path/to/parquet_files/000000000000.parquet', None, None)})
kwargs: {}
Exception: "TypeError('String Arrays is not yet implemented in cudf')"
The workflow includes nvt.ops.Categorify
and nvt.ops.Groupby
operations to create a string array of sequential events per grouped entity.
Sorry for this ridiculously late response @orlev2 - Just coming across this now.
As far as I can tell, the rapids/dask pinning in Merlin has been far too loose. NVTabular 23.8 was definitely not tested with cudf>=23.08 or dask>=2023.8.
The merlin 23.08 containers use cudf-23.04
(which uses dask-2023.1.1
), so using that is your best bet.
NOTE: The lack of upper pinning in NVTabular is indeed a "bug" of sorts - I apologize about that.
Describe the bug Reading parquet dataset on GPU throws an "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error. Reading the data on CPU runs successfully:
Steps/Code to reproduce bug
Expected behavior The dataset should be read from file under both cpu=True/False
Environment details (please complete the following information):
nvtabular == 23.8.00 cudf == 23.10.02 (above error was also present under 23.12.01) dask == 2023.9.2
@niraj06