NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1` #1857

Open fedaeho opened 1 year ago

fedaeho commented 1 year ago

Describe the bug nvt.ops.Categorify don't process vocabs correctly when num_buckets>1 is given simultaneously.

Steps/Code to reproduce bug

I tried to use categorify transform with pre-defined vocabs. I also have to consider multiple oov, so I also gives num_buckets>1 for parameter.

from merlin.core import dispatch
import pandas as pd
import nvtabular as nvt

df = dispatch.make_df(
        {
            "Authors": [["User_A"], ["User_A", "User_E"], ["User_B", "User_C"], []],
            "Post": [1, 2, 3, 4],
        }
    )

cat_names = ["Authors"]
label_name = ["Post"]

vocabs = {"Authors": pd.Series([f"User_{x}" for x in "ACBE"])}
cat_features = cat_names >> nvt.ops.Categorify(
    num_buckets=2, vocabs=vocabs, max_size = {"Authors": 8},
)

workflow = nvt.Workflow(cat_features + label_name)
df_out = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()

For above code, expected index for each values are like below.

But, I get following result with wrong category dictionary.

Authors Post
0 [7] 1
1 [ 7 10] 2
2 [9 8] 3
3 [] 4
kind offset num_indices
0 pad 0 1
1 null 1 1
2 oov 2 1
3 unique 3 4
Authors
3 User_A
4 User_C
5 User_B
6 User_E

I check inside of Categorify.process_vocabs function and oov_count can get num_buckets correctly. But when process_vocabs function call Categorify._save_encodings(), it doesn't make the vocabulary dictionary correctly.

Expected behavior From https://github.com/NVIDIA-Merlin/NVTabular/blob/77b94a40babfea160130c70160dfdf60356b4f16/nvtabular/ops/categorify.py#L432-L438

I fix the code whereprocess_vocab call Categorify._save_encodings with oov_count.

    def process_vocabs(self, vocabs):
      ...
                oov_count = 1
                if num_buckets:
                    oov_count = (
                        num_buckets if isinstance(num_buckets, int) else num_buckets[col_name]
                    ) or 1
                col_df = dispatch.make_df(vals).dropna()
                col_df.index += NULL_OFFSET + oov_count
                # before
                # save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)
                # after
                save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)
and I got following result of df_out like as I expected. Authors Post
0 [4] 1
1 [ 4 7] 2
2 [6 5] 3
3 [] 4

Environment details (please complete the following information):

Additional context None

EvenOldridge commented 1 year ago

In all of the applications I've built OOV has been a single embedding and used to represent the fact that the item is new or rare. Can you help me understand the use case? Why would you want multiple OOV values. They're so rare that they'll effectively end up as random embeddings. Grouping them gives you some information.