NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.04k
stars
143
forks
source link
[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping #1748
Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing.
When Categorify runs, it creates a mapping between original and encoded values.
Problem
There are currently some issues in the encoding of Categorify:
Collision of special encodings - Some special values -- Nulls, Out-Of-Vocabulary and infrequent items (when using frequency capping) -- are all encoded to id 0, so it is not possible to differentiate between them for modeling purposes.
Inconsistent mapping in unique values parquet - When the NVTabular workflow fits, the encoding mapping is persisted to parquet file with the unique values. But the mapping in the parquet file does not match the actual mapping performed by categorify (e.g. for not considering the start_index or max_size #1736). There are more examples of these mismatches in this doc (Nvidia internal).
It is important for Merlin that we make it more straightforward to map the encoded ids back to the original values by using just a mapping table, without the need to be aware of the complex logic inside Categorify to cover the hashing options available. In RecSys, reverse mapping is critical for item id, as that is what models will be predicting (encoded ids) and they need to be presented to the user (original ids).
Proposed solution
This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details
Save a consistence mapping table to parquet
[ ] As discussed above, ensure that when Categorify saves the unique values parquet, the mapping matches exactly the encoded item ids, so that it is easy to use that parquet mapping to reserve back to original ids.
Fixing the first 3 values of encoding table: \<PADDING>, \<OOV>, \<NULL>
[ ] Eliminating start_index argument. It was created to allow for users reserving a range of control ids (the first N encoded ids) so that no original value is mapped to them by Categorify. Then, the user could do some post-processing after Categorify to set some values in that range. The only use case we found so far for that was reserving 0 for padding sequence features, which is common for sequential recommendation. So we remove start_index and reserve just the id 0 for padding (or for other purposes the user might use it). That simplifies a lot the logic within Categorify as start_index shifts all other values.
[ ] Mapping Out-of-Vocabulary (OOV) values during workflow.transform() always to id (1)
[ ] Eliminating na_sentinel as nulls will be always mapped to a single id (2).
Hashing / Frequency hashing
[ ] Rename max_size (used for frequency capping and frequency hashing) to top_frequent_values, so that users don’t assume that the value will be the maximum cardinality (as that one will also include the special ids and the num_buckets).
Pre-defined vocabulary
[ ] When the user provides the vocabs argument with a pre-defined mapping (from original values to encoded ids) our encoding standard does not apply and we just use that mapping. We should reserve an extra position in the end of their mapping table to assign values that are potentially not found (including nulls) in the vocabs mapping
Proposed encoding strategy
Original value | Encoded id
-- | --
Fixed
\ | 0
\ | 1
\ | 2
========= When not using hashing =========
1st most frequent value | 3
2nd most frequent value | 4
3rd most frequent value | 5
… | …
========= When using simple hashing - `Categorify(num_buckets=3)` =========
\ Hash bucket #1 | 3
\ Hash bucket #2 | 4
\ Hash bucket #3 | 5
========= When using frequency capping based on threshold (`freq_threshold`) or number of top-k values (`max_size->top_frequent_values`) =========
\ Infrequent bucket | 3
1st most frequent value | 4
2nd most frequent value | 5
3rd most frequent value | 6
… | …
========= When using frequency hashing - `(num_buckets, max_size->top_frequent_values)` =========
\ Infrequent hash bucket #1 | 3
\ Infrequent hash bucket #2 | 4
\ Infrequent hash bucket #3 | 5
… | …
1st most frequent value | n-4
2nd most frequent value | n-3
3rd most frequent value | n-2
4th most frequent value | n-1
5th most frequent value | n
Context
Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing. When Categorify runs, it creates a mapping between original and encoded values.
Problem
There are currently some issues in the encoding of Categorify:
start_index
ormax_size
#1736). There are more examples of these mismatches in this doc (Nvidia internal). It is important for Merlin that we make it more straightforward to map the encoded ids back to the original values by using just a mapping table, without the need to be aware of the complex logic inside Categorify to cover the hashing options available. In RecSys, reverse mapping is critical for item id, as that is what models will be predicting (encoded ids) and they need to be presented to the user (original ids).Proposed solution
This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details
Save a consistence mapping table to parquet
Fixing the first 3 values of encoding table: \<PADDING>, \<OOV>, \<NULL>
start_index
argument. It was created to allow for users reserving a range of control ids (the first N encoded ids) so that no original value is mapped to them by Categorify. Then, the user could do some post-processing after Categorify to set some values in that range. The only use case we found so far for that was reserving 0 for padding sequence features, which is common for sequential recommendation. So we removestart_index
and reserve just the id 0 for padding (or for other purposes the user might use it). That simplifies a lot the logic within Categorify as start_index shifts all other values.na_sentinel
as nulls will be always mapped to a single id (2).Hashing / Frequency hashing
max_size
(used for frequency capping and frequency hashing) totop_frequent_values
, so that users don’t assume that the value will be the maximum cardinality (as that one will also include the special ids and the num_buckets).Pre-defined vocabulary
vocabs
argument with a pre-defined mapping (from original values to encoded ids) our encoding standard does not apply and we just use that mapping. We should reserve an extra position in the end of their mapping table to assign values that are potentially not found (including nulls) in thevocabs
mappingProposed encoding strategy