NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] Categorify `start_index` not handled by Inference Op `CategorifyTransform` #1800

Closed oliverholworthy closed 1 year ago

oliverholworthy commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

When using Categorify with a non-default start_index value, the inference version of the operator returns a different result.

Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

import numpy as np
import cudf
import nvtabular as nvt

input_tensors = {
    "a": np.array(["x", "y", "z"])
}
df = cudf.DataFrame(input_tensors)
cat_names = df.columns
cats = cat_names >> nvt.ops.Categorify(start_index=1)
workflow = nvt.Workflow(cats)
workflow.fit(nvt.Dataset(df))

feature_transformed = workflow.transform(df)["a"]
# => [2, 3, 4]

model_config = {}
inference_op = cats.op.inference_initialize(cats.input_columns, model_config)
output_tensors = inference_op.transform(cats.input_columns, input_tensors)

feature_transformed_inference = output_tensors["a"]
# => [1, 2, 3]

Expected behavior A clear and concise description of what you expected to happen.

Expect the inference version to return the same value as the regular operator transform

karlhigley commented 1 year ago

Seems like something we can probably fix in the cpp version of the op for now, but in the longer term I'm wondering if we can replace the cpp versions of the ops with NumPy/CuPy versions using TensorTable. We'll soon be able to mix and match ops that process DFs vs TTs in the same graph, so I also wonder if we could just use those as the single implementation of these transforms to avoid this kind of skew/mismatch between different versions.

oliverholworthy commented 1 year ago

One thing we could do is raise an exception at export time in systems to detect if there's a Categorify op being used with non-default args. Which would be preferable to returning results that are wrong and hard to debug.

rnyak commented 1 year ago

@oliverholworthy we no longer have start_index. we can check how systems work with new categorify op.

oliverholworthy commented 1 year ago

Yes, we removed start_index in #1692 . So I think we can close this issue now. We also disabled the use of the 'inference' version of ops in https://github.com/NVIDIA-Merlin/systems/pull/354 (at least for now), which reduces the chances of a mismatch between inference-time transform and workflow-time