NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

[BUG] `Groupby` with single column for grouping and calculating results throws an error #1635

Open radekosmulski opened 2 years ago

radekosmulski commented 2 years ago

Describe the bug This functionality is important because often we might want to group by an identifier column (such as customer_id for instance), perform some calculations on the groupings and separately return the column that we grouped by on (for instance, to add some sort of target using that column, or even juts to be able to tell what subset of data each calculated value relates to).

Having this functionality would be very useful, because often we might want to still run some additional preprocessing on the numerical values that are returned by the original grouping. Essentially, we need to process the identifier column separately.

Currently, doing so throws an error as below image

Steps/Code to reproduce bug Reproducer code:

import cudf
import nvtabular as nvt
import numpy as np
import datetime

purchases = cudf.DataFrame(
    data={
        'customer_id': np.random.randint(0, 10, 1000),
        'purchase_date': [datetime.date(2022, np.random.randint(1,13), np.random.randint(1,29)) for i in range(1000)],
        'quantity': np.random.randint(1, 50, 1000)
})
purchases.head()

ds = nvt.Dataset(purchases)

last_purchase_quantity = ['quantity', 'purchase_date'] >> nvt.ops.Groupby(
    groupby_cols='customer_id',
    sort_cols='purchase_date',
    aggs={'quantity': 'last'}
) >> nvt.ops.Normalize()

customer_id = ['customer_id', 'purchase_date'] >> nvt.ops.Groupby(
    groupby_cols='customer_id',
    sort_cols='purchase_date',
    aggs={'customer_id': 'last'}
)

wf = nvt.Workflow(last_purchase_quantity + customer_id)

wf.fit_transform(ds).compute()

Expected behavior No error is raised and the values are returned correctly

Environment details (please complete the following information): current NVTabular main

karlhigley commented 2 years ago

It looks like the column names are getting mangled by the GroupBy op somehow:


z> /nvtabular/nvtabular/workflow/workflow.py(523)_transform_partition()
-> col_series = output_df[col_name]
(Pdb) output_df
   c_u_s_t_o_m_e_r___i_d
0                      5
1                      0
2                      6
3                      2
4                      8
5                      4
6                      1
7                      9
8                      7
9                      3
(Pdb) col_name
'customer_id'
radekosmulski commented 2 years ago

This is being worked on under #1636