Open fedaeho opened 1 year ago
In all of the applications I've built OOV has been a single embedding and used to represent the fact that the item is new or rare. Can you help me understand the use case? Why would you want multiple OOV values. They're so rare that they'll effectively end up as random embeddings. Grouping them gives you some information.
Describe the bug
nvt.ops.Categorify
don't processvocabs
correctly whennum_buckets>1
is given simultaneously.Steps/Code to reproduce bug
I tried to use
categorify
transform with pre-defined vocabs. I also have to consider multiple oov, so I also givesnum_buckets>1
for parameter.For above code, expected index for each values are like below.
But, I get following result with wrong category dictionary.
df_out
pd.read_parquet("./categories/meta.Authors.parquet")
pd.read_parquet("./categories/unique.Authors.parquet")
I check inside of
Categorify.process_vocabs
function andoov_count
can getnum_buckets
correctly. But whenprocess_vocabs
function callCategorify._save_encodings()
, it doesn't make the vocabulary dictionary correctly.Expected behavior From https://github.com/NVIDIA-Merlin/NVTabular/blob/77b94a40babfea160130c70160dfdf60356b4f16/nvtabular/ops/categorify.py#L432-L438
I fix the code where
process_vocab
callCategorify._save_encodings
withoov_count
.df_out
like as I expected.Environment details (please complete the following information):
pip
Additional context None