[QST] Why does nvt.ops.Categorify in 23.06 add 3 to the cardinality of a dataset's column?

bogdan-radu-nechita commented 1 year ago

I'm fairly new to the field, so thank you in advance for your patience and time!

When using Categorify with a merlin tensorflow container there is something going on within that function that causes (in my opinion) erroneous metadata to be saved to the schema of the dataset. More specifically\, the reported properties.embedding_sizes.cardinality for the columns is off by a constant. This causes the processed columns to start with the number 3 instead of 0 or 1 and MatrixFactorizationModelV2 to generate an extra set of 3 embeddings. Is this the expected behavior?

Containers used for testing this and the constants:

merlin-tensorflow 23.06 23.04 23.02 22.12
google vertex ai container onto which I manually installed merlin (hopefully correctly)

Container version    Extra rows
23.06                    3
23.04                    0
23.02                    0
22.12                    0

As environments:

VertexAI Workbook (both managed and user) using 2 x T4
Workstation with AMD processor and a 2080ti

I'll be more than happy to provide whatever information you need. I tried digging into it myself, but I am still green enough to not be able to figure it out.

Merlin-tensorflow 23.06

Merlin-tensorflow 23.04/23.02/22.12

Thank you again for your time and patience!

rnyak commented 1 year ago

starting from release 23.06, Categorify op is designed to encode the categories of a given column in such way

reserve 0 for padding (you wont see 0s in your encoded dataset)
encode nulls as 1 (if your dataset has nulls you will see them encoded as 1)
encode OOVs as 2 (if your validation or test set has out of vocabularies, then these categories will be encoded as 2)
start encoding regular categories from 3 (your most frequent category in a categorical column will be encoded as 3 the second most frequent as 4 and so on so forth)

hope that answers your question.

bogdan-radu-nechita commented 1 year ago

My sincerest apologies... I was looking at outdated docs and code. Thank you for your time and again, please forgive my silly mistake!

NVIDIA-Merlin / NVTabular

[QST] Why does nvt.ops.Categorify in 23.06 add 3 to the cardinality of a dataset's column? #1856