NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] `Groupby` ags output columns (even if they are counts) as `Tags.CATEGORICAL` #1841

Open radekosmulski opened 1 year ago

radekosmulski commented 1 year ago

Describe the bug The output of the Categorify operator is tagged as Tags.CATEGORICAL even if those might just be counts. image

Steps/Code to reproduce bug Here is a full reproducer:

df = cudf.DataFrame(data={'id': [0, 0, 1, 2], 'cat': list(range(4)), 'val': [10, 20, 20, 30]})

ds = nvt.Dataset(df)

cat = ['cat'] >> nvt.ops.Categorify()
groupby_features = cat + ['id', 'val'] >> nvt.ops.Groupby(
    groupby_cols=['id'],
    aggs={
        'cat': ['list', 'count'],
        'val': ['list']
    }
)

wf = nvt.Workflow(groupby_features)
ds_transformed = wf.fit_transform(ds)

Expected behavior count columns are not tagged as Tags.CATEGORICAL

Environment details (please complete the following information): current main

karlhigley commented 1 year ago

If I understand the example, the behavior of Categorify doesn't seem to be the issue. If the workflow stopped after the Categorify, categorical would be the correct label (since the output of Categorify is always categorical by definition.)

But since the workflow continues after that, the subsequent operators need to adjust the schemas of the columns they operate on in order to reflect the transforms they implement. It looks like Groupby isn't correctly applying Tags.CONTINUOUS when the aggregation is a count.

karlhigley commented 1 year ago

@jperez999, could you point someone who isn't you (@radekosmulski, @oliverholworthy, @nv-alaiacano, or @edknv maybe, if they're willing) at how to fix this in the Groupby op?

rnyak commented 1 year ago

@karlhigley @radekosmulski tagging count column as Continuous might be problematic as well.. Since an integer value (count is an int), can be seen as a categorical value for a user. It does not have to be continuous. Therefore, I'd say we should not tag count agg neither as continuous nor as categorical from Groupby op.

If a count column normalized afterwards, then it will take continuous tag, if it is gonna be bucketed or hashed via categorify op then it can take categorical tag.

karlhigley commented 1 year ago

I think you're right about how things should ultimately be tagged, but I want to re-iterate that it's each op's responsibility to apply the appropriate tags based only on the transformation that op applies (not future or previous ops, who can take care of that for themselves.)

If the result of a count aggregation is continuous, then Groupby should tag it as such. If subsequent operations like Normalize or hashing/bucketing operations change whether it's continuous, categorical, or an embedding, then those ops should apply whatever tag becomes appropriate after their transformations.

We don't want to be guessing what ops a user might apply subsequently in order to apply tags; we should just apply the currently correct tags and rely on subsequent ops to change them as needed.

rnyak commented 1 year ago

@karlhigley I agree. What I wanted to say should count be considered continuous or categorical? I say it could be both :)

karlhigley commented 1 year ago

I don't know either 😄