[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping

NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Apache License 2.0

1.04k stars 143 forks source link

[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping #1748

Closed gabrielspmoreira closed 1 year ago

gabrielspmoreira commented 1 year ago

Context

Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing. When Categorify runs, it creates a mapping between original and encoded values.

Problem

There are currently some issues in the encoding of Categorify:

Collision of special encodings - Some special values -- Nulls, Out-Of-Vocabulary and infrequent items (when using frequency capping) -- are all encoded to id 0, so it is not possible to differentiate between them for modeling purposes.
Inconsistent mapping in unique values parquet - When the NVTabular workflow fits, the encoding mapping is persisted to parquet file with the unique values. But the mapping in the parquet file does not match the actual mapping performed by categorify (e.g. for not considering the start_index or max_size #1736). There are more examples of these mismatches in this doc (Nvidia internal). It is important for Merlin that we make it more straightforward to map the encoded ids back to the original values by using just a mapping table, without the need to be aware of the complex logic inside Categorify to cover the hashing options available. In RecSys, reverse mapping is critical for item id, as that is what models will be predicting (encoded ids) and they need to be presented to the user (original ids).

Proposed solution

This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details

Save a consistence mapping table to parquet

[ ] As discussed above, ensure that when Categorify saves the unique values parquet, the mapping matches exactly the encoded item ids, so that it is easy to use that parquet mapping to reserve back to original ids.

Fixing the first 3 values of encoding table: \<PADDING>, \<OOV>, \<NULL>

[ ] Eliminating start_index argument. It was created to allow for users reserving a range of control ids (the first N encoded ids) so that no original value is mapped to them by Categorify. Then, the user could do some post-processing after Categorify to set some values in that range. The only use case we found so far for that was reserving 0 for padding sequence features, which is common for sequential recommendation. So we remove start_index and reserve just the id 0 for padding (or for other purposes the user might use it). That simplifies a lot the logic within Categorify as start_index shifts all other values.
[ ] Mapping Out-of-Vocabulary (OOV) values during workflow.transform() always to id (1)
[ ] Eliminating na_sentinel as nulls will be always mapped to a single id (2).

Hashing / Frequency hashing

[ ] Rename max_size (used for frequency capping and frequency hashing) to top_frequent_values, so that users don’t assume that the value will be the maximum cardinality (as that one will also include the special ids and the num_buckets).

Pre-defined vocabulary

[ ] When the user provides the vocabs argument with a pre-defined mapping (from original values to encoded ids) our encoding standard does not apply and we just use that mapping. We should reserve an extra position in the end of their mapping table to assign values that are potentially not found (including nulls) in the vocabs mapping

Proposed encoding strategy

rjzamora commented 1 year ago

Thanks for writing this up @gabrielspmoreira!

This proposal seems to be moving in the right direction. The only thing I am still a bit unclear on is the specific expectation of what data will be included in the unique-value file. Is the overall goal to make it possible for the reverse mapping to be extracted from the unique-value file alone?

Lets take the following column example:

      x
0   cat
1   cat
2   dog
3  <NA>
4  fish
5   cat
6   dog

I'm assuming the default version of Categorify will produce a unique-value file like this:

      x  x_size
2  <NA>       1
3   cat       3
4   dog       2
5  fish       1

where the index is just a RangIndex starting at 2 (the null index), because 0 and 1 are reserved for padding and OOV. What I don't always understand is what should happen for non-default options. For example, what should this file look like if the user specifies max_size=2 or freq_threshold=2? Do we then include a row for OOV, and include the value count for values that will now be mapped to 1? E.g.

      x  x_size
1  <NA>       1
2  <NA>       1
3   cat       3
4   dog       2

If so, should the "value" be listed as <NA> for OOV? I am also confused about cases like num_buckets=2. Will we then store hash values in the file? E.g.

              x  x_size
2    2654435768       1
3     110891232       3
4    1341686676       2
5    2980454984       1

If so, how would we include an OOV-value count when max_size or freq_threshold is also specified?

EvenOldridge commented 1 year ago

@rnyak @gabrielspmoreira @nv-alaiacano @sararb can you add examples of how the mapping works under different scenarios.

rnyak commented 1 year ago

Thanks for writing this up @gabrielspmoreira!

This proposal seems to be moving in the right direction. The only thing I am still a bit unclear on is the specific expectation of what data will be included in the unique-value file. Is the overall goal to make it possible for the reverse mapping to be extracted from the unique-value file alone?

Lets take the following column example:
      x
0   cat
1   cat
2   dog
3  <NA>
4  fish
5   cat
6   dog
I'm assuming the default version of Categorify will produce a unique-value file like this:
      x  x_size
2  <NA>       1
3   cat       3
4   dog       2
5  fish       1
where the index is just a RangIndex starting at 2 (the null index), because 0 and 1 are reserved for padding and OOV. What I don't always understand is what should happen for non-default options. For example, what should this file look like if the user specifies max_size=2 or freq_threshold=2? Do we then include a row for OOV, and include the value count for values that will now be mapped to 1? E.g.
      x  x_size
1  <NA>       1
2  <NA>       1
3   cat       3
4   dog       2
If so, should the "value" be listed as <NA> for OOV? I am also confused about cases like num_buckets=2. Will we then store hash values in the file? E.g.
              x  x_size
2    2654435768       1
3     110891232       3
4    1341686676       2
5    2980454984       1
If so, how would we include an OOV-value count when max_size or freq_threshold is also specified?

@rjzamora That's a good question and we need to clarify. @gabrielspmoreira in the scenario of simple hashing (setting num_bucket >1), how is this going to be mapped in the unique_item_id parquet file? note that in the parquet file we have item_id and item_size columns, where the former is the original item_id. what are we going to write under item_id column in the hashing case? I dont think we want to write down a list of all the original item_ids that fall in Hash bucket #1. We should clarify that.

Besides, it is better we keep item_id_size column in the unique item_id parquet file since it is useful information because it basically corresponds to item_id frequencies, and these frequencies can be used for different purposes like we do in PopularityLogitsCorrection in Merlin Models.

karlhigley commented 1 year ago

It might make sense to save the original item id, the hashed value (when hashing is applied), and the item id counts all together in the Parquet files, because then it could be used as a lookup at serving time (so we don't actually have to run the hash functions there) and the counts could be used to do popularity-based filtering/sampling/re-ordering.

rjzamora commented 1 year ago

It might make sense to save the original item id, the hashed value (when hashing is applied), and the item id counts all together in the Parquet files, because then it could be used as a lookup at serving time (so we don't actually have to run the hash functions there) and the counts could be used to do popularity-based filtering/sampling/re-ordering.

Not completely sure I follow - A ~key~ common motivation for hashing in practice is that recording all unique values is intractable (due to high-cardinality or other limitations). If the unique-value file was required to record all unique values that have been hashed to a particular value, then performance would drop down to that of direct encoding (or worse).

rnyak commented 1 year ago

It might make sense to save the original item id, the hashed value (when hashing is applied), and the item id counts all together in the Parquet files, because then it could be used as a lookup at serving time (so we don't actually have to run the hash functions there) and the counts could be used to do popularity-based filtering/sampling/re-ordering.

Not completely sure I follow - A ~key~ common motivation for hashing in practice is that recording all unique values is intractable (due to high-cardinality or other limitations). If the unique-value file was required to record all unique values that have been hashed to a particular value, then performance would drop down to that of direct encoding (or worse).

If the unique-value file was required to record all unique values >> I say we should not record. The question is more how we design a unique parquet file in case of using simple hashing with num_buckets >=1 arg, for exp? what we will write for the original item_id? the hashed value? and the index would be encoded bucket id? what's your idea for this situation?

karlhigley commented 1 year ago

If the unique-value file was required to record all unique values that have been hashed to a particular value, then performance would drop down to that of direct encoding (or worse).

Hmm, what I was thinking may or may not make sense. Let me see if I can lay out my thought process and see where we diverge:

We usually feed categorical features through Categorify, which sometimes includes ids and other high-cardinality features you might choose to use hashing for
Hashing high-cardinality categoricals like ids helps to reduce the size of embedding tables in the resulting models
We're also reasonably likely to use the unhashed ids to look up features at serving time (e.g. from a feature store, Redis, etc)
Having a way to put the raw id, categorified id, and hashed id next to each other in allows us to avoid running Categorify at serving time and replace it with an extra column returned by a lookup we'd already do anyway

Maybe the difference in what we're thinking comes from the context difference between "in GPU memory" and "in storage somewhere (device memory, disk, maybe on the other end of a network call)"?

rjzamora commented 1 year ago

Please keep in mind that I have much less experience with the production/serving side of things, so I'm sorry if I'm being naive.

Hashing high-cardinality categoricals like ids helps to reduce the size of embedding tables in the resulting models

Yeah, that makes sense. It is probably important to establish that there are two big reasons why we may need hashing. The first is to optimize the size of embedding tables - I'm assuming this motivation kicks in even when it is still possible to comfortably fit all unique values in memory. The second is to get around the fact that a truly high-cardinality feature may be too large to fit in memory (or it is just large enough to put you at risk of OOM errors).

For the first case, it may be beneficial to take some extra time in the NVT fit to record the mapping between each hash value to the list of all known "real" values. However, this is definitely not something you would want to require Categoify to do for all cases. It is also only beneficial if you need to be able to map a hashed feature to the list of all known "real" features.

We're also reasonably likely to use the unhashed ids to look up features at serving time (e.g. from a feature store, Redis, etc)

Can you explain this a bit more? Are you just saying that you will often be working with "real" rather than encoded features at serving time? If so, makes sense to me.

Having a way to put the raw id, categorified id, and hashed id next to each other in allows us to avoid running Categorify at serving time and replace it with an extra column returned by a lookup we'd already do anyway

It seems to me like hash-based encoding should be the one case where the fit stage of Categorify should not even need to read-in and use the "unique-value" file. I realize this may not be the case now, but all you need to do is map null values to 1, and then encode all other values by doing a hash + mod(num_buckets) + add(3). Therefore, it seems like running Categorify could be much faster than reading in and joining a lookup table.

rjzamora commented 1 year ago

Some notes to establish my current understanding of what the unique-value file should look like for various Categorify scenarios...

Assume the following raw column x:

          x
0     Edith
1     Alice
2      <NA>
3       Bob
4   Charlie
5     Edith
6     Frank
7      <NA>
8       Dan
9     Edith
10    Frank
11     Gary
12      Bob

Default

I would expect a default (direct) Categorify fit to produce:

         x  x_size
1     <NA>       2
2     <NA>       0
3    Edith       3
4      Bob       2
5    Frank       2
6    Alice       1
7  Charlie       1
8      Dan       1
9     Gary       1

Here, we see that the index of 0 is reserved for padding, and is therefore completely excluded from the file. Index-value 1 is included to record the null-value count, and index 2 is included to record the OOV count. However, for a default/direct encoding, there should never be any OOV values.

Setting `max_size`

If the Categorify operation is modified to use max_size=4, I would assume that we are essentially taking the exact same result, but then leaving out rows > 4 and adding the dropped "x_size" sum to the OOV row:

       x  x_size
1   <NA>       2
2   <NA>       6
3  Edith       3
4    Bob       2

Note that a very similar logic would be used for frequency capping. That is, we start with the full mapping, and then clip the table down to the necessary size.

[EDIT: Is it correct to map the clipped values to OOV, or should we be introducing a distinct "infrequent-values" index?]

Setting `num_buckets`

If the user specifies num_buckets=4, then we should probably avoid writing an "x" column in the file, and only include "x_size". In this case, the unique value file should only be used to record value-count statistics, becasue the actual raw-to-encoded value mapping is simply add(mod(hash(raw), num_buckets), 3) for all cases.

Please note that this is the scenario that I am most uncertain of. I am making the assumption that the current options in Categorify simply allow users to combine hashing with other options that we simply shouldn't allow them to be combined with. That is, it doesn't really make sense to mix freq_threshold or max_size with num_buckets, because the reverse mapping becomes meaningless. Once the user chooses hashing, it no longer makes sense for the mapping to depend on the value-count or desired table size. Please do correct me if I have this wrong. I strongly suspect that I am missing an obvious exception to my thinking :)

rnyak commented 1 year ago

For the default scenario, can not the item_id be at the index 2 (like below)? why it reads as ? we wont have any OOVs in training but in the valid set we can, therefore in the valid set when we see encoded categories (e.g. item_ids) as 2 we'd know these are for OOVs.

         x  x_size
1     <NA>       2
2     <OOV>      0
3    Edith       3
4      Bob       2
5    Frank       2
6    Alice       1
7  Charlie       1
8      Dan       1
9     Gary       1

rjzamora commented 1 year ago

For the default scenario, can not the item_id be at the index 2 (like below)? why it reads as ?

Are you asking why I have <NA> in the OOV row? If so, this is because the "value" for the OOV row would need to be null. I don't think we should arbitrarily choose a string or numerical value to represent OOV.

we wont have any OOVs in training but in the valid set we can, therefore in the valid set when we see encoded categories (e.g. item_ids) as 2 we'd know these are for OOVs.

Yes it makes sense to reserve index 2 for OOV, and that is what I was attempting to show (1 is for nulls, and 2 is for OOV - However, it doesn't make sense to include a literal "value" for either of these rows).

rjzamora commented 1 year ago

I strongly suspect that I am missing an obvious exception to my thinking :)

It seems like one case that probably complicates my basic understanding is that of "frequency hashing". Judging from the design-doc, it seems like this is the primary case where there is a mix of direct and hash-based encoding. Therefore, the idea to leave out the "x" column in the unique-value file may be problematic. I suppose any rows corresponding hash buckets could just include null values in the "x" column? E.g. for freq_threshold=2 and num_buckets=2:

         x  x_size
1     <NA>       2
2     <NA>       0
3     <NA>       1
4     <NA>       3
5    Edith       3
6      Bob       2
7    Frank       2

where index 3 and 4 correspond to hash buckets where infrequent values are mapped to instead of OOV. I can't say I really understand why this would be more useful than mapping infrequent values to OOV, but I can understand the motivation to provide the option.

This leads me to a simpler question: Should max_size and freq_threshold always result in a distinct row (or more for num_buckets>1) for "infrequent values"? My example above assumed the unfrequent values should be mapped to the OOV index, but I am now realizing the design doc is calling for slightly-more complex behavior.

Possible compromise [EDIT]

Perhaps the simplest way to deal with the overlap between "OOV" and "infrequent buckets" is to use num_buckets to determine the number of rows (starting with index 2) to use for OOV mapping. For example, if max_size or freq_threshold is used while num_buckets is set to 1 or None, we only use index 2 for OOV mapping:

       x  x_size
1   <NA>       2
2   <NA>       6
3  Edith       3
4    Bob       2

Here, index 2 corresponds to the 0th (and only) OOV "bucket".

However, if num_buckets is set greater than 1 (say num_buckets=3), we can simply use indices 2:2+num_buckets to map the "infrequent" values to OOV buckets (using hashing):

       x  x_size
1   <NA>       2
2   <NA>       2
3   <NA>       1
4   <NA>       3
5  Edith       3
6    Bob       2

karlhigley commented 1 year ago

We're also reasonably likely to use the unhashed ids to look up features at serving time (e.g. from a feature store, Redis, etc)

Can you explain this a bit more? Are you just saying that you will often be working with "real" rather than encoded features at serving time? If so, makes sense to me.

Yeah, pretty much. We'll get some raw/unprocessed entity id in a request (like a user's UUID), use that id to fetch some unprocessed entity attributes from a key/value store (which often lives on the other side of a network request), then feed those attributes through some feature processing with NVT before they get to the model. The K/V store is basically a collection of lookup tables anyway, so we may be able to do the translation from raw id to processed id as part of the K/V store request, if we produce output in a format that could be loaded there and fetched on a per-row basis instead of keeping the whole lookup table in GPU memory.

Having a way to put the raw id, categorified id, and hashed id next to each other in allows us to avoid running Categorify at serving time and replace it with an extra column returned by a lookup we'd already do anyway

It seems to me like hash-based encoding should be the one case where the fit stage of Categorify should not even need to read-in and use the "unique-value" file. I realize this may not be the case now, but all you need to do is map null values to 1, and then encode all other values by doing a hash + mod(num_buckets) + add(3). Therefore, it seems like running Categorify could be much faster than reading in and joining a lookup table.

I think this may be a "different contexts" kind of thing. At serving time we don't run the fit at all, wouldn't want to read in or hold the table in memory, and definitely don't want to do any joins. If possible, we don't even want to work with dataframes at all, since we'll have a tiny number of rows. The ideal case is that we send a short list of ids to a data store and it hands us back the data associated with them. Since it's possible to store very large data with tools like Redis/PostGres/Cassandra/BigTable and we'd probably be doing that already, I'm not worried about the space an extra column might take up.

We're currently using an approach like this in our multi-stage serving example, where we use NVT to Categorify ids and then we store both the raw and processed id in the feature store. That gives us a way to map from one to the other just by doing a look-up, which also allows us to do "reverse Categorify" without loading the unique values file into memory.

rjzamora commented 1 year ago

I think this may be a "different contexts" kind of thing. At serving time we don't run the fit at all, wouldn't want to read in or hold the table in memory, and definitely don't want to do any joins. If possible, we don't even want to work with dataframes at all, since we'll have a tiny number of rows. The ideal case is that we send a short list of ids to a data store and it hands us back the data associated with them.

This totally makes sense, and is mostly consistent with what I was thinking. At serving time you could certainly use a feature store to encode raw data, and use simple hashing (and update the K/V store) if there is a key miss. If you use the Categorify encodings up-front to populate the k/v store, then you shouldn't have a key miss for data you have already seen in your training set.

What I'm saying we probably don't want to do is require Categorify.fit to aggregate a list of all unique values when we already know that they are being mapped by hash. The need to aggregate high-cardinality values is a known OOM pain point in NVT, and simply mapping a function to update a k/v store (by hash) would be much more reliable.

I'm not worried about the space an extra column might take up.

Just want to clarify that the concern has nothing to do with adding a column to the parquet file, but with the need to actually aggregate the extra data to write in that column (probably across multiple parquet files). For a dataset with billions of unique values, we are talking about aggregating gigabytes of extra data. My intuition is that you probably do want to provide a formal mechanism to record a mapping for all encountered unique values. However, I'm trying to discourage this mechanism from becoming a row in the unique-value parquet file that is currently used by Categorify.

karlhigley commented 1 year ago

I'm trying to discourage this mechanism from becoming a row in the unique-value parquet file that is currently used by Categorify.

Yeah, I think that makes sense. It seems like the unique value file is semi-implicitly taking on more functionality than it was intended to have, so I wonder if there are potentially multiple outputs/artifacts to be created from Categorify.

rjzamora commented 1 year ago

It seems like the unique value file is semi-implicitly taking on more functionality than it was intended to have

Right, I think that is the gist of what I am trying to communicate.

karlhigley commented 1 year ago

I could be missing it, but I'm not sure the current implementation of Categorify for inference time supports any functionality beyond loading the unique values file and applying the mapping from it, so that's part of what's driving the push to put everything in the unique values file.

rjzamora commented 1 year ago

I could be missing it, but I'm not sure the current implementation of Categorify for inference time supports any functionality beyond loading the unique values file and applying the mapping from it, so that's part of what's driving the push to put everything in the unique values file.

I haven’t looked at the cpp/inference code before, but you may be right about that. I’m probably proposing that the current Categorify code should be revised if we are going to establish new official encoding conventions anyway.

For the case of simple hashing, it makes no sense to me that we are currently writing out a full unique-value file, because the index of that file does not correspond to the hash bucket for each row. I'm not sure if the inference code is using that file anyway, but the _encode function (used by Categorify.transform) is essentially reading in the file (which was memory-intensive to create) and then completely ignoring it (using the result of _hash_bucket instead).

gabrielspmoreira commented 1 year ago

Hey folks. Thanks for this fruitful discussion about the proposal. I gonna start replying some of you comments from the top. P.s. Sorry for my delayed answer (was out for 2 weeks for vacation) and working in other stuff when back.

gabrielspmoreira commented 1 year ago

For example, what should this file look like if the user specifies max_size=2 or freq_threshold=2? Do we then include a row for OOV, and include the value count for values that will now be mapped to 1? E.g.
      x  x_size
1  <NA>       1
2  <NA>       1
3   cat       3
4   dog       2
@rjzamora No. We would always reserve ids 0-2 for padding and OOV and nulls. We could have them in the unique mapping parquet file to make it clear (with index starting at 0). The non-default options affect what happens starting in index 3. In your example with max_size=2 or freq_threshold=2, I think we should return

      x  x_size
0  <PAD>      0
1  <OOV>      0
2  <NULL>     1
3  <INFREQ>   1  (all infrequent items would be mapped to this index, in this case "fish")
4   cat       3
5   dog       2

gabrielspmoreira commented 1 year ago

I am also confused about cases like num_buckets=2. Will we then store hash values in the file? E.g.

              x  x_size
2    2654435768       1
3     110891232       3
4    1341686676       2
5    2980454984       1

@rjzamora We could just have a token like "" instead of the actual hashed value. The reverse categorify could just return that literal ("HASHED") so that the requester system can know that is a hashed value and no corresponding original value will be found. I max_size or freq_threshold were used together with num_buckets, that will mean that "" means an infrequent value and the requester system might have a business rule to either ignore that recommendation or replace it.

gabrielspmoreira commented 1 year ago

If so, how would we include an OOV-value count when max_size or freq_threshold is also specified?

@rjzamora I think we'll only have values encoded as in workflow.transform(), when a categorical value was not seen during workflow.fit() AND num_buckets=0. In workflow.fit() the OOV will never occur, because all values are "seen". So if max_size>0 or freq_threshold>0, then infrequent values will be mapped to a single index (3) if num_buckets=0 and to one of the many [3,num_buckets+3) if num_buckets > 0. Does that make sense?

gabrielspmoreira commented 1 year ago

[EDIT: Is it correct to map the clipped values to OOV, or should we be introducing a distinct "infrequent-values" index?]

@rjzamora Yes, we should be using separate indices to OOV (2) and hashed infrequent values (starting in index 3)

gabrielspmoreira commented 1 year ago

Please note that this is the scenario that I am most uncertain of. I am making the assumption that the current options in Categorify simply allow users to combine hashing with other options that we simply shouldn't allow them to be combined with. That is, it doesn't really make sense to mix freq_threshold or max_size with num_buckets, because the reverse mapping becomes meaningless. Once the user chooses hashing, it no longer makes sense for the mapping to depend on the value-count or desired table size. Please do correct me if I have this wrong. I strongly suspect that I am missing an obvious exception to my thinking :)

@rjzamora The user might be interested in combining freq_threshold or max size with num_buckets>1. The reason is that with num_buckets you can reduce the number of collisions, as not all infrequent items will be hashed to the same bucket/index. And reducing collisions, we are in theory able to train better embedding. In models that use hashing to reduce cardinality of feature interactions, typically we get better accuracy by allowing hashing to use more buckets (in the order of tens or hundreds of thousands).

gabrielspmoreira commented 1 year ago

Are you asking why I have <NA> in the OOV row? If so, this is because the "value" for the OOV row would need to be null. I don't think we should arbitrarily choose a string or numerical value to represent OOV.

@rjzamora I think it could be good to use literal string for representing the special tokens, so that when someone wants to use the uniques parquet for reverse categorify, by just looking up the predicted (encoded) item ids using the uniques table they would be able to different between what is "", "", "" or a valid original id, and maybe use those literals for some business rules to treat those special encoded values accordingly.

gabrielspmoreira commented 1 year ago

This leads me to a simpler question: Should max_size and freq_threshold always result in a distinct row (or more for num_buckets>1) for "infrequent values"? My example above assumed the unfrequent values should be mapped to the OOV index, but I am now realizing the design doc is calling for slightly-more complex behavior.

@rjzamora Yes, having more buckets for infrequent items reduces collision, as I explained above. I understand that is the current behavior of Categorify, when you use freq_threshold/max_size>0 and num_buckets>0. I mean, that is not a new functionality right?

rjzamora commented 1 year ago

I think it could be good to use literal string for representing the special tokens

I'm pretty sure I will be pushing back on this pretty strongly. This will become a pretty big mess, unless you are requiring all categorical columns to be represented as a string dtype. There are plenty of cases where other dtypes (like integer or timestamp) are more useful.

Is have serious doubts that we will want to have special <OOV> or <INFREQ> values for the unique-value parquet file.

gabrielspmoreira commented 1 year ago

Perhaps the simplest way to deal with the overlap between "OOV" and "infrequent buckets" is to use num_buckets to determine the number of rows (starting with index 2) to use for OOV mapping. For example, if max_size or freq_threshold is used while num_buckets is set to 1 or None, we only use index 2 for OOV mapping:

@rjzamora You are right. If we use num_buckets>0, then we will never have OOV, because any value (known or not) will be hashed to a bucket in workflow.transform(). So, when num_buckets>0 your proposal would be starting the hashed bucket at index 2 (position originally reserved for OOV), right? It makes sense. I have been thinking that we should always reserve the first 3 positions of the table for , , and that any other encoded value would start at index 3 in all cases, just to make that standard clear and easier for users.

rjzamora commented 1 year ago

I have been thinking that we should always reserve the first 3 positions of the table for , , and that any other encoded value would start at index 3 in all cases, just to make that standard clear and easier for users.

Yeah, that motivation made a lot of sense to me as well until I realized that num_buckets is really synonymous with "the number of OOV buckets to use," and so it is actually a bit more confusing to have a reserved OOV row in addition to hash-bucket rows.

gabrielspmoreira commented 1 year ago

Yeah, pretty much. We'll get some raw/unprocessed entity id in a request (like a user's UUID), use that id to fetch some unprocessed entity attributes from a key/value store (which often lives on the other side of a network request), then feed those attributes through some feature processing with NVT before they get to the model. The K/V store is basically a collection of lookup tables anyway, so we may be able to do the translation from raw id to processed id as part of the K/V store request, if we produce output in a format that could be loaded there and fetched on a per-row basis instead of keeping the whole lookup table in GPU memory.

I think this may be a "different contexts" kind of thing. At serving time we don't run the fit at all, wouldn't want to read in or hold the table in memory, and definitely don't want to do any joins. If possible, we don't even want to work with dataframes at all, since we'll have a tiny number of rows. The ideal case is that we send a short list of ids to a data store and it hands us back the data associated with them. Since it's possible to store very large data with tools like Redis/PostGres/Cassandra/BigTable and we'd probably be doing that already, I'm not worried about the space an extra column might take up.

We're currently using an approach like this in our multi-stage serving example, where we use NVT to Categorify ids and then we store both the raw and processed id in the feature store. That gives us a way to map from one to the other just by doing a look-up, which also allows us to do "reverse Categorify" without loading the unique values file into memory.

@karlhigley I think that storing the encoded values by Categorify in a Feature Store, not requiring to perform categorify transform on serving is a valid solution and we should support that.
If I understood correctly, in case the customer uses another Feature Store than the one we integrate in Merlin Systems (Feast), I understand they would have to implement that solution themselves, i.e., taking the preprocessed data from NVTabular and saving to their Feature Store the encoded ids for all categorical features, right? And for reverse categorify, would they either use the uniques parquet file, or create their own raw<->encoded item id mapping by computing unique values (e.g. df[["raw_id", "encoded_id"]].unique() ) in the preprocessed parquet files?

I think we should support reverse categorify by just using the unique parquet file. Either if the item id cardinality is large, for example 100M itens the mapping table would occupy 1.14GB (assuming encoded ids are int32 and original ids are int64). P.s. If original ids are strings, then it would be more and maybe a Feature Store / database would be a better place for users doing the reverse categorify themselves.

rjzamora commented 1 year ago

I think we should support reverse categorify by just using the unique parquet file.

I'm not clear on what it would mean to enable reverse lookup from the parquet-file alone if the parquet file includes more than one hash bucket? Or, is that statement only for the case where there is no hashing?

karlhigley commented 1 year ago

I think that storing the encoded values by Categorify in a Feature Store, not requiring to perform categorify transform on serving is a valid solution and we should support that.

Pretty sure @nv-alaiacano will push back on that for reasons I find persuasive, but that's what we're currently doing in our multi-stage example notebook. My bigger concern is supporting the other case where we store untransformed values, because in order to do that successfully we'd need to be able to run Categorify online at serving time fast enough to achieve reasonable latency—which currently requires a separate C++ implementation of the operator that I'm not sure matches the main implementation, and am nearly 100% certain won't match the main implementation after these changes (which I support, to be clear.)

I think we should support reverse categorify by just using the unique parquet file. Either if the item id cardinality is large, for example 100M itens the mapping table would occupy 1.14GB (assuming encoded ids are int32 and original ids are int64).

I'm with you in general, but I think @rjzamora's point about the unique values file taking on too many responsibilities and the difficulties that creates in the hashing case are valid too. I'm wondering if we need multiple output files to serve the needs of different parts of Merlin.

rjzamora commented 1 year ago

@gabrielspmoreira - What about the following variation of the proposal (at least, as I understand it so far)?

We move to the following conventions for the "unique-value" Parquet file:

We call this data encodings_<field-name>.parquet to reflect that the index directly corresponds to the expected encoding
- Side Note: We should not require this to be a single file. It might contain multiple files like encodings_x.*.parquet if/when needed.
Each Parquet file will use a RangeIndex (corresponding to the actual integer encoding), and three required columns: x, x_kind, and x_size
Each Parquet file may contain an optional x_list column to record observed values during the fit stage. This column will not be written by default, because it may cause significant memory pressure to aggregate these values. However, users who are okay with the performance overhead may collect these observed values before they are hashed into buckets (and therefore make reverse mapping possible from the encodings_x.parquet file alone). Another possibility would be to optionally write this information in a separate bucket_encodings_x.parquet file.

Behavior examples

Assume the following x columns (and the presented modular hash values) for all these examples:

       x  hash%2  hash%4
0    cat       1       3
1   fish       0       2
2    cat       1       3
3   None       1       3
4    dog       0       2
5   bird       1       1
6    dog       0       2
7  snake       1       1

Direct Encoding

For simple/direct encoding, the encodings_x.parquet file(s) will reserve the first three indices for "special" encodings (pad, null, and oov):

       x   x_kind  x_size
0   None      pad       0
1   None     null       1
2   None      oov       0
3    dog  literal       2
4    cat  literal       2
5   fish  literal       1
6   bird  literal       1
7  snake  literal       1

Frequency capping with single infrequent-value bucket

For frequency capping (or table-size capping), the "oov" row will be relabelled as "infreq":

      x   x_kind  x_size
0  None      pad       0
1  None     null       1
2  None   infreq       3
3   dog  literal       2
4   cat  literal       2

Frequency capping with two infrequent-value buckets

There can be an arbitrary number of infrequent-value buckets:

      x   x_kind  x_size
0  None      pad       0
1  None     null       1
2  None   infreq       1
3  None   infreq       2
4   dog  literal       2
5   cat  literal       2

Frequency capping with two infrequent-value buckets AND optional list column

Users may opt into the construction of an observed-value list column. This will certainly increase the computational cost of Categorify.fit, but may be worthwhile for some use cases. Another note that we could also write this information out in a distinct file.

      x   x_kind  x_size x_list(optional)
0  None      pad       0             None
1  None     null       1           [None]
2  None   infreq       1           [fish]
3  None   infreq       2    [bird, snake]
4   dog  literal       2            [dog]
5   cat  literal       2            [cat]

Full hash encoding with four buckets AND optional list column

      x  x_kind  x_size x_list(optional)
0  None     pad       0             None
1  None    null       1           [None]
2  None  bucket       0               []
3  None  bucket       2    [bird, snake]
4  None  bucket       3      [fish, dog]
5  None  bucket       2            [cat]

gabrielspmoreira commented 1 year ago

@rjzamora I think your examples are great and I like your proposal in general! Just have some minor comments/questions.

1) I was thinking we could have a single x column that combines both original values and special ones like (pad, null, oov, bucket, infreq,...). You propose creating a new x_type column to store the encoding type / special tokens separatelly. I think it is more organized, and also if the original values dtype are not string (e.g. int), storing strings with the special tokens in the same column x as the values could be an issue. My only concern adding x_type column is using a bit more memory when keeping that table in memory for reverse categorify. Maybe we could replace use Noneinstead of "literal" for the non-special encoded values, to save some space?

2) I don't understand the need of having the x_list column. It might contain very large lists for high cardinality items and small number of hashing buckets. And it doesn't allow reversing back hashed values to the original ids anyway. I think we don't need to look for a solution for reversing back hashed values to the original item id, because that should be a rare case in retrieval models that return top-k predicted item ids. Let me explain why. Typically, we won't use Categorify(num_buckets>0) alone for item id, because that would hash all items into buckets and the model would be predicting top-k buckets instead of top-k items. In general, when we want to reduce the cardinality of item id, we use either max_size or freq_threshold to keep the most popular items "unhashed" and want to hash the very long tail of items that doesn't have a bare minimum of interactions (e.g. 5 or 10) so that meaningful embeddings can be trained. As the models tend to predict among top-k items the somewhat frequent items, I believe it would be rare having the model predicting among top-k some ids corresponding to the hashing buckets reserved for infrequent items (because they are infrequent targets during model training). And if that happens, the requester system should be able to know that an item is a hashed infrequent value, to either ignore it or replace it by another item recommendation using some business rule.

Does that make sense?

rjzamora commented 1 year ago

Good questions @gabrielspmoreira !

My only concern adding x_type column is using a bit more memory when keeping that table in memory for reverse categorify. Maybe we could replace use None instead of "literal" for the non-special encoded values, to save some space?

Internal code should not need to read in the "x_kind" column, as it would only be for users who want to inspect the encoding_x.parquet file. However, you are correct that this would take up more memory when writing the file, and would require a bit more space on disk. I was assuming that we wold use a categorical dtype in memory (not strings/objects), and that the column would be dictionary-encoded in parquet (meaning that the file would only store the unique strings ("null", "oov", etc.) and would use a light-weight integer array for the encoding of each element. With that said, its always less memory/storage to use nan in places where a literal value is not necessary (e.g. "literal").

Regarding x_list: I would be very happy not to support this if it is not needed :)

gabrielspmoreira commented 1 year ago

Makes sense @rjzamora . Just to explain what I see as a reverse categorify op we might have in Systems (or that our customers might want to build themselves) ... The reverse categorify op would load the encoding_x.parquet file and keep it in memory in some dataframe or dictionary data structure. Then, when it receives as input some top-k (hundreds or thousands) predicted encoded item ids, that op would just perform a simple lookup in the encoding mapping to obtain the original categorical values, which might be either ints or strings (e.g. product SKUs). If some of the top-k ids are special encoded values (e.g. oov, null, hashed), that op should have some standard token (e.g. "", "", "" to return to the requester system (as there is no original id to return in such cases). Does that make sense?

rjzamora commented 1 year ago

If some of the top-k ids are special encoded values (e.g. oov, null, hashed), that op should have some standard token (e.g. "", "", "" to return to the requester system (as there is no original id to return in such cases). Does that make sense?

Yes, I think this makes sense to me. If you use encoding_x.parquet to directly look up the original value for some encoding you would get back either a literal value or nan. If the result is a literal value, the reverse-encoding logic would just return that value. If the result is nan, the reverse-encoding logic could return some "special" string value to distinguish between "null" and "oov", "infreq", or "bucket". Is that what you are saying?

rjzamora commented 1 year ago

My only concern adding x_type column is using a bit more memory when keeping that table in memory for reverse categorify.

Just to come back to this for a moment - Including this extra column is definitely not something I really want to do, because it does add approximately one byte per row (assuming we use a categorical dtype). I guess the only reason I proposed it is that I feel even less confident about using special tokens in the x column. In my mind, the most efficient approach is to simply leave out the x_type column, and use null values in the x column for anything that is not a direct/literal encoding.

This shouldn't hinder reverse mapping in any way, because 0 will always be for padding, 1 will always be nulls, and then everything up until the first non-null x row will be for oov/infreq/bucket encoding (the specific meaning depends on the encoding options, but the meaning can also be inferred by the starting index and the existence of non-null values in x).

karlhigley commented 1 year ago

We don't necessarily have to hold the data for reverse categorify in memory in the same format it's stored on disk (or at all, if we store the majority of it in a data store we're going to query anyway.)

rjzamora commented 1 year ago

Okay, given the useful comments and information from both @gabrielspmoreira and @karlhigley, I'm thinking the best solution here is to keep the encodings.x.parquet standard as simple as possible. That is,

The parquet files will use a simple RangIndex to specify the encoded integer values.
The encodings.x.parquet data must include an x column (where x is the name of the column or column group), and an x_size column:
- The x column shall use a null value for any category that does not correspond to a direct unique-value mapping.
- The x_size column should include the observed-value counts collected during the original Categorify.fit operation.
The 0 encoding will be reserved for padding
The 1 encoding will be reserved for null values
Encodings 2 : 2 + num_buckets will be used for OOV, infrequent values, or hash buckets. Note that the default and minimum value for num_buckets shall be 1.
Encodings 2 + num_buckets and greater will be used for literal unique-value encodings.

As far as I understand, this standard makes the encoding rules and results pretty clear. Embeddings 0 and 1 will always mean the same thing (padding and nulls), and so the value of num_buckets can be easily inferred from the number of null values in the x column (num_buckets = null_count - 2 in all cases). We can also infer the number of unique-value encodings from the number of non-null values in the x column. This all means that a simple Merlin utility could easily use this file to provide a reverse mapping (as long as they don't expect us to specify a list of all possible values observed for a specific "bucket" encoding - where we would probably want the utility to return either "infreq" or "hashed", depending on the value of num_buckets).

rnyak commented 1 year ago

@rjzamora I am a bit confused of 2 + num_buckets and greater will be used for literal unique-value encodings.. are we making num_buckets a required arg to be set? or it is optional? what will be the default value for it? None? This arg should not be used/set in a native/plain Categorify() . It should be None.

do I understand correctly?

rnyak commented 1 year ago

and what happens to max_size, freq_threshold args?

rjzamora commented 1 year ago

Good questions @rnyak - I am suggesting that it should be optional for the user to pass in num_buckets, but that the default value should be 1. However, the other Categorify kwargs/defaults may need to be tweaked a bit to better-clarify the fact that setting num_buckets does not mean that we will forgo unique-value collection, but that we will use num_buckets (with hashing when num_buckets>1) to split up any categories that do not satisfy max_size or freq_threshold.

To this end, I suppose the max_size and freq_threshold should probably correspond to direct encodings only. This way the user would be required to set something like max_size=0 to ensure that all encodings (besides pad/null) are hash buckets.

Does this make sense?

rnyak commented 1 year ago

Good questions @rnyak - I am suggesting that it should be optional for the user to pass in num_buckets, but that the default value should be 1. However, the other Categorify kwargs/defaults may need to be tweaked a bit to better-clarify the fact that setting num_buckets does not mean that we will forgo unique-value collection, but that we will use num_buckets (with hashing when num_buckets>1) to split up any categories that do not satisfy max_size or freq_threshold.

To this end, I suppose the max_size and freq_threshold should probably correspond to direct encodings only. This way the user would be required to set something like max_size=0 to ensure that all encodings (besides pad/null) are hash buckets.

Does this make sense?

just to clarify. when you say num_buckets=1 does that mean we are not gonna do any hashing/bucketing right? we will do vanilla encoding, meaning every category will be encoded to a unique value?

This way the user would be required to set something like max_size=0 to ensure that all encodings (besides pad/null) are hash buckets.

max_size =0 that is interesting and a bit confusing. max_size means we want to categorify max_size of most frequent values for a column will be encoded regularly, and the rest will go to a bucket / mapped to a single value?

considering that, if we set max_size =0, what's that mean? does it mean all categories are mapped to one single value? and what's the point of doing that?

rjzamora commented 1 year ago

just to clarify. when you say num_buckets=1 does that mean we are not gonna do any hashing/bucketing right? we will do vanilla encoding, meaning every category will be encoded to a unique value?

Short answer is yes.

Longer answer: I'm suggesting that what we are currently calling num_buckets should probably mean something a bit different than it does right now, and that it should only indicates the number of indices we are reserving for "out-of-vocabulary" encodings (which includes "infrequent" values and hash buckets, since both infrequent values and hash-buckets are not a part of the reversible vocabulary).

The picture I have in my mind is that there are really only four types of encodings to consider:

"pad" (reserve 1 index)
"null" (reserve 1 index)
"oov/irreversible" (reserve 1+ indices)
"distinct" (use the remaining 0+ indices)

considering that, if we set max_size =0, what's that mean? does it mean all categories are mapped to one single value? and what's the point of doing that?

Yes, I agree. I don't think anyone would want to set max_size=0 with the default num_buckets=1. I'm suggesting that people who don't want/need reversible encoding for a particular column may want to combine max_size=0 with num_buckets>>1 (to hash everything).

rjzamora commented 1 year ago

Update: I have started to use https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 as a space to experiment with the ideas discussed here. I ultimately found that the simplest approach is to move away from the idea that we should include more information in "unique.x.parquet". Instead, I now feel that we should simplify that file even further, and include a simple summary of embedding information in a distinct "meta.x.parquet" file.

For example, when using freq_threshold=3 and num_buckets=10 together, we would write something like the following for a hypothetical "Author" column:

Contents of "unique.Author.parquet":

    Author  Author_size
12  User_B            3
13  User_C            3

Here, we see that only two unique values satisfied the freq_threshold=3 limit in the "Author" column, and since we are reserving index 0 for padding, index 1 for nulls, and indices [2,12) for oov hashing, the RangeIndex offset is 12. Note that this "unique.Author.parquet" file only includes information about in-vocabulary unique values. It does not include information about nulls or indistinct categories. If the user wants more information about null or indistinct categories, they would need to inspect the corresponding "meta.Author.parquet" file.

Contents of "meta.Author.parquet":

     kind  offset  num_indices  num_observed
0     pad       0            1             0
1    null       1            1             0
2     oov       2           10             4
3  unique      12            2             6

Here, we find a summary of the high-level encoding metadata. The "kind" column specifies the kind of category that row corresponds to.

The "offset" column specifies the index offset for that kind of category. For example, "pad" will always have an offset of 0, "null" will always have an offset of 1, and "oov" will always have an offset of 2. The "unique"-row offset should always match the RangeIndex offset observed in the "unique.x.parquet" file (which is 12 in this case).

The "num_indices" column specifies how many encoding indices have been reserved for each kind of category. Note that the "oov" row should always have a "num_indices" value that matches the num_buckets setting (or 1), while the value in the "unique" row will match the length of the "unique.x.parquet" file (2 in this case).

The "num_observed" column specifies the total number of occurrences observed for each "kind" of category during the Categorify.fit stage. This means that the sum of all values in the "num_observed" column must match the total number of rows in the dataset used to perform the fit. This also means that we can use this row to clearly distinguish the null-value count from that of the infrequent categories (which fall into the "oov" category since we do not record them in our "unique.x.parquet"-file vocabulary).

@gabrielspmoreira @rnyak @karlhigley - Let me know if you think a solution like this makes sense to you. Note that I tried recording all of this information in the same "unique.x.parquet" file, and I'm pretty confident that we want to avoid that solution.

karlhigley commented 1 year ago

I think this makes sense, but we're definitely going to need to provide functionality elsewhere in Merlin to do an explicit reverse mapping. From experience, I can promise that some user somewhere out there is going to look at the unique values files, try to write their own mapping, ignore the index, and complain to us that Categorify isn't producing the right values because it doesn't line up with the rows in the unique values file (if one assumes the first row is row zero, which it clearly isn't in your example.)

I think we're in a pretty good place now to get the reverse mapping done as a generic Merlin array/tensor operator that can run almost anywhere (except in NVT, at least for now.)

rjzamora commented 1 year ago

we're definitely going to need to provide functionality elsewhere in Merlin to do an explicit reverse mapping.

Agree - Though the logic would not need to be very complex with this latest design.

From experience, I can promise that some user somewhere out there is going to look at the unique values files, try to write their own mapping, ignore the index, and complain to us that Categorify isn't producing the right values because it doesn't line up with the rows in the unique values file (if one assumes the first row is row zero, which it clearly isn't in your example.)

Yeah, I totally agree that this solution is still likely to confuse some users. With that said, this last design is certainly less confusing that what we have now, and it is much better for the specific user-provided vocab example you are describing (since the unique.*.parquet mapping is not expected to include null or oov information). I don't think it would be too difficult to dramatically improve the docstring for providing a user-provided vocabulary, and make it completely clear that we will always start the unique-value index after oov/hash buckets.

It sounds like you think it is problematic that we are always reserving 0, 1. and 2+ for specific purposes - even when the user provides a specific unique-value mapping? Would it be more intuitive to start the null and oov indices after the unique-value encodings? I don't have any strong opinions about this at all - Just happened to run with the original suggestion to make the pad/null/oov-encodings consistent for all cases.

karlhigley commented 1 year ago

Oh, I don't have a problem with any of this or any suggestion for making it better! I'm just saying that we (me, in all likelihood) need to cut users off at the pass by providing reverse mapping functionality before they try to implement reverse mapping themselves again, because I've already seen that go weirdly once or twice. These changes entail more work to do, but it'll happen outside of Categorify itself and is stuff we had reasons to do anyway. We just can't let it languish in the backlog much longer, especially with this overhaul happening.

rjzamora commented 1 year ago

I'm just saying that we (me, in all likelihood) need to cut users off at the pass by providing reverse mapping functionality before they try to implement reverse mapping themselves again

Yes, I agree that it will be best to provide a simple reverse mapping utility

NVIDIA-Merlin / NVTabular

[BUG] Simplify Categorify encoding for better standardization and easier reverse mapping #1748

Context

Problem

Proposed solution

Save a consistence mapping table to parquet

Fixing the first 3 values of encoding table: \<PADDING>, \<OOV>, \<NULL>

Hashing / Frequency hashing

Pre-defined vocabulary

Proposed encoding strategy

Default

Setting max_size

Setting num_buckets

Possible compromise [EDIT]

Behavior examples

Direct Encoding

Frequency capping with single infrequent-value bucket

Frequency capping with two infrequent-value buckets

Frequency capping with two infrequent-value buckets AND optional list column

Full hash encoding with four buckets AND optional list column

Setting `max_size`

Setting `num_buckets`