NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k stars 143 forks source link

[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping #1748

Closed gabrielspmoreira closed 1 year ago

gabrielspmoreira commented 1 year ago

Context

Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing. When Categorify runs, it creates a mapping between original and encoded values.

Problem

There are currently some issues in the encoding of Categorify:

Proposed solution

This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details

Save a consistence mapping table to parquet

Fixing the first 3 values of encoding table: \<PADDING>, \<OOV>, \<NULL>

Hashing / Frequency hashing

Pre-defined vocabulary

Proposed encoding strategy

Original value | Encoded id -- | -- Fixed \ | 0 \ | 1 \ | 2 ========= When not using hashing ========= 1st most frequent value | 3 2nd most frequent value | 4 3rd most frequent value | 5 … | … ========= When using simple hashing - `Categorify(num_buckets=3)` ========= \ Hash bucket #1 | 3 \ Hash bucket #2 | 4 \ Hash bucket #3 | 5 ========= When using frequency capping based on threshold (`freq_threshold`) or number of top-k values (`max_size->top_frequent_values`) ========= \ Infrequent bucket | 3 1st most frequent value | 4 2nd most frequent value | 5 3rd most frequent value | 6 … | … ========= When using frequency hashing - `(num_buckets, max_size->top_frequent_values)` ========= \ Infrequent hash bucket #1 | 3 \ Infrequent hash bucket #2 | 4 \ Infrequent hash bucket #3 | 5 … | … 1st most frequent value | n-4 2nd most frequent value | n-3 3rd most frequent value | n-2 4th most frequent value | n-1 5th most frequent value | n
nv-alaiacano commented 1 year ago

Resolved with https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 🎉