NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[QST] Why does nvt.ops.Categorify in 23.06 add 3 to the cardinality of a dataset's column? #1856

Closed bogdan-radu-nechita closed 1 year ago

bogdan-radu-nechita commented 1 year ago

I'm fairly new to the field, so thank you in advance for your patience and time!

When using Categorify with a merlin tensorflow container there is something going on within that function that causes (in my opinion) erroneous metadata to be saved to the schema of the dataset. More specifically\, the reported properties.embedding_sizes.cardinality for the columns is off by a constant. This causes the processed columns to start with the number 3 instead of 0 or 1 and MatrixFactorizationModelV2 to generate an extra set of 3 embeddings. Is this the expected behavior?

Containers used for testing this and the constants:

merlin-tensorflow 23.06 23.04 23.02 22.12
google vertex ai container onto which I manually installed merlin (hopefully correctly)

Container version    Extra rows
23.06                    3
23.04                    0
23.02                    0
22.12                    0

As environments:

VertexAI Workbook (both managed and user) using 2 x T4
Workstation with AMD processor and a 2080ti

I'll be more than happy to provide whatever information you need. I tried digging into it myself, but I am still green enough to not be able to figure it out.

Merlin-tensorflow 23.06 image

Merlin-tensorflow 23.04/23.02/22.12 image

Thank you again for your time and patience!

rnyak commented 1 year ago

starting from release 23.06, Categorify op is designed to encode the categories of a given column in such way

hope that answers your question.

bogdan-radu-nechita commented 1 year ago

My sincerest apologies... I was looking at outdated docs and code. Thank you for your time and again, please forgive my silly mistake!