[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping

Context

Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing. When Categorify runs, it creates a mapping between original and encoded values.

Problem

There are currently some issues in the encoding of Categorify:

Collision of special encodings - Some special values -- Nulls, Out-Of-Vocabulary and infrequent items (when using frequency capping) -- are all encoded to id 0, so it is not possible to differentiate between them for modeling purposes.
Inconsistent mapping in unique values parquet - When the NVTabular workflow fits, the encoding mapping is persisted to parquet file with the unique values. But the mapping in the parquet file does not match the actual mapping performed by categorify (e.g. for not considering the start_index or max_size #1736). There are more examples of these mismatches in this doc (Nvidia internal). It is important for Merlin that we make it more straightforward to map the encoded ids back to the original values by using just a mapping table, without the need to be aware of the complex logic inside Categorify to cover the hashing options available. In RecSys, reverse mapping is critical for item id, as that is what models will be predicting (encoded ids) and they need to be presented to the user (original ids).

Proposed solution

This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details

Save a consistence mapping table to parquet

[ ] As discussed above, ensure that when Categorify saves the unique values parquet, the mapping matches exactly the encoded item ids, so that it is easy to use that parquet mapping to reserve back to original ids.

Fixing the first 3 values of encoding table: \<PADDING>, \<OOV>, \<NULL>

[ ] Eliminating start_index argument. It was created to allow for users reserving a range of control ids (the first N encoded ids) so that no original value is mapped to them by Categorify. Then, the user could do some post-processing after Categorify to set some values in that range. The only use case we found so far for that was reserving 0 for padding sequence features, which is common for sequential recommendation. So we remove start_index and reserve just the id 0 for padding (or for other purposes the user might use it). That simplifies a lot the logic within Categorify as start_index shifts all other values.
[ ] Mapping Out-of-Vocabulary (OOV) values during workflow.transform() always to id (1)
[ ] Eliminating na_sentinel as nulls will be always mapped to a single id (2).

Hashing / Frequency hashing

[ ] Rename max_size (used for frequency capping and frequency hashing) to top_frequent_values, so that users don’t assume that the value will be the maximum cardinality (as that one will also include the special ids and the num_buckets).

Pre-defined vocabulary

[ ] When the user provides the vocabs argument with a pre-defined mapping (from original values to encoded ids) our encoding standard does not apply and we just use that mapping. We should reserve an extra position in the end of their mapping table to assign values that are potentially not found (including nulls) in the vocabs mapping

Proposed encoding strategy

NVIDIA-Merlin / NVTabular

[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping #1748