Open rnyak opened 3 years ago
Describe the bug
When we jointly encode categorical columns, nvt.ops.get_embedding_sizes(workflow) does not generate the correct embedding table.
nvt.ops.get_embedding_sizes(workflow)
Steps/Code to reproduce bug
df = cudf.DataFrame({'a_user_id': ["User_A","User_E","User_B","User_C","User_A","User_B","User_B","User_C","User_B","User_A"], 'b_user_id': ["User_B", "User_F", "User_D", "User_D", "User_B", "User_E", "User_E", "User_D", "User_D", "User_D"], 'media':[3, 3, 12, 17, 3, 1, 1, 0, 1, 12], 'language': ['en', 'en', 'spn', 'fr', 'spn', 'en', 'fr', 'ch', 'ch', 'en']}) dataset = nvt.Dataset(df) cat_users = ([['a_user_id','b_user_id']]) >> nvt.ops.Categorify(encode_type = 'joint') cat_others = ['media', 'language'] >> nvt.ops.Categorify() workflow = nvt.Workflow(cat_users + cat_others) workflow.fit(dataset) new_gdf = workflow.transform(dataset).to_ddf().compute() new_gdf.head() a_user_id b_user_id media language 0 1 2 3 2 1 5 6 3 2 2 2 4 4 4 3 3 4 5 3 4 1 2 3 4
nvt.ops.get_embedding_sizes(workflow) {'media': (6, 16), 'language': (5, 16), 'a_user_id': (0, 16), 'b_user_id': (0, 16)}
Expected behavior The following embedding table shapes are expected:
nvt.ops.get_embedding_sizes(workflow) {'media': (6, 16), 'language': (5, 16), 'a_user_id_b_user_id': (6, 16)}
Environment details (please complete the following information):
@benfred do you think this is a bug, or nvt.ops.get_embedding_sizes(workflow) gives the expected output?
Describe the bug
When we jointly encode categorical columns,
nvt.ops.get_embedding_sizes(workflow)
does not generate the correct embedding table.Steps/Code to reproduce bug
Expected behavior The following embedding table shapes are expected:
Environment details (please complete the following information):