NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] get_embedding_sizes generates wrong embedding shape with encode_type = 'joint' #864

Open rnyak opened 3 years ago

rnyak commented 3 years ago

Describe the bug

When we jointly encode categorical columns, nvt.ops.get_embedding_sizes(workflow) does not generate the correct embedding table.

Steps/Code to reproduce bug

df = cudf.DataFrame({'a_user_id': ["User_A","User_E","User_B","User_C","User_A","User_B","User_B","User_C","User_B","User_A"],
                    'b_user_id': ["User_B", "User_F", "User_D", "User_D", "User_B", "User_E", "User_E", "User_D", "User_D", "User_D"], 
                    'media':[3, 3, 12, 17, 3, 1, 1, 0, 1, 12], 'language': ['en', 'en', 'spn', 'fr', 'spn', 'en', 'fr', 'ch', 'ch', 'en']})
dataset = nvt.Dataset(df)
cat_users = ([['a_user_id','b_user_id']]) >> nvt.ops.Categorify(encode_type = 'joint')
cat_others = ['media', 'language'] >> nvt.ops.Categorify()
workflow = nvt.Workflow(cat_users + cat_others)
workflow.fit(dataset)
new_gdf = workflow.transform(dataset).to_ddf().compute()
new_gdf.head()

a_user_id b_user_id media language
0   1   2   3   2
1   5   6   3   2
2   2   4   4   4
3   3   4   5   3
4   1   2   3   4
nvt.ops.get_embedding_sizes(workflow)

{'media': (6, 16),
 'language': (5, 16),
 'a_user_id': (0, 16),
 'b_user_id': (0, 16)}

Expected behavior The following embedding table shapes are expected:

nvt.ops.get_embedding_sizes(workflow)

{'media': (6, 16),
 'language': (5, 16),
 'a_user_id_b_user_id': (6, 16)}

Environment details (please complete the following information):

rnyak commented 2 years ago

@benfred do you think this is a bug, or nvt.ops.get_embedding_sizes(workflow) gives the expected output?