Closed gabrielspmoreira closed 1 year ago
Thanks for writing this up @gabrielspmoreira!
This proposal seems to be moving in the right direction. The only thing I am still a bit unclear on is the specific expectation of what data will be included in the unique-value file. Is the overall goal to make it possible for the reverse mapping to be extracted from the unique-value file alone?
Lets take the following column example:
x
0 cat
1 cat
2 dog
3 <NA>
4 fish
5 cat
6 dog
I'm assuming the default version of Categorify
will produce a unique-value file like this:
x x_size
2 <NA> 1
3 cat 3
4 dog 2
5 fish 1
where the index is just a RangIndex
starting at 2 (the null index), because 0 and 1 are reserved for padding and OOV. What I don't always understand is what should happen for non-default options. For example, what should this file look like if the user specifies max_size=2
or freq_threshold=2
? Do we then include a row for OOV, and include the value count for values that will now be mapped to 1? E.g.
x x_size
1 <NA> 1
2 <NA> 1
3 cat 3
4 dog 2
If so, should the "value" be listed as <NA>
for OOV? I am also confused about cases like num_buckets=2
. Will we then store hash values in the file? E.g.
x x_size
2 2654435768 1
3 110891232 3
4 1341686676 2
5 2980454984 1
If so, how would we include an OOV-value count when max_size
or freq_threshold
is also specified?
@rnyak @gabrielspmoreira @nv-alaiacano @sararb can you add examples of how the mapping works under different scenarios.
Thanks for writing this up @gabrielspmoreira!
This proposal seems to be moving in the right direction. The only thing I am still a bit unclear on is the specific expectation of what data will be included in the unique-value file. Is the overall goal to make it possible for the reverse mapping to be extracted from the unique-value file alone?
Lets take the following column example:
x 0 cat 1 cat 2 dog 3 <NA> 4 fish 5 cat 6 dog
I'm assuming the default version of
Categorify
will produce a unique-value file like this:x x_size 2 <NA> 1 3 cat 3 4 dog 2 5 fish 1
where the index is just a
RangIndex
starting at 2 (the null index), because 0 and 1 are reserved for padding and OOV. What I don't always understand is what should happen for non-default options. For example, what should this file look like if the user specifiesmax_size=2
orfreq_threshold=2
? Do we then include a row for OOV, and include the value count for values that will now be mapped to 1? E.g.x x_size 1 <NA> 1 2 <NA> 1 3 cat 3 4 dog 2
If so, should the "value" be listed as
<NA>
for OOV? I am also confused about cases likenum_buckets=2
. Will we then store hash values in the file? E.g.x x_size 2 2654435768 1 3 110891232 3 4 1341686676 2 5 2980454984 1
If so, how would we include an OOV-value count when
max_size
orfreq_threshold
is also specified?
@rjzamora That's a good question and we need to clarify. @gabrielspmoreira in the scenario of simple hashing (setting num_bucket >1), how is this going to be mapped in the unique_item_id parquet file? note that in the parquet file we have item_id
and item_size
columns, where the former is the original item_id
. what are we going to write under item_id
column in the hashing case? I dont think we want to write down a list of all the original item_ids that fall in Hash bucket #1. We should clarify that.
Besides, it is better we keep item_id_size
column in the unique item_id parquet file since it is useful information because it basically corresponds to item_id frequencies, and these frequencies can be used for different purposes like we do in PopularityLogitsCorrection
in Merlin Models.
It might make sense to save the original item id, the hashed value (when hashing is applied), and the item id counts all together in the Parquet files, because then it could be used as a lookup at serving time (so we don't actually have to run the hash functions there) and the counts could be used to do popularity-based filtering/sampling/re-ordering.
It might make sense to save the original item id, the hashed value (when hashing is applied), and the item id counts all together in the Parquet files, because then it could be used as a lookup at serving time (so we don't actually have to run the hash functions there) and the counts could be used to do popularity-based filtering/sampling/re-ordering.
Not completely sure I follow - A ~key~ common motivation for hashing in practice is that recording all unique values is intractable (due to high-cardinality or other limitations). If the unique-value file was required to record all unique values that have been hashed to a particular value, then performance would drop down to that of direct encoding (or worse).
It might make sense to save the original item id, the hashed value (when hashing is applied), and the item id counts all together in the Parquet files, because then it could be used as a lookup at serving time (so we don't actually have to run the hash functions there) and the counts could be used to do popularity-based filtering/sampling/re-ordering.
Not completely sure I follow - A ~key~ common motivation for hashing in practice is that recording all unique values is intractable (due to high-cardinality or other limitations). If the unique-value file was required to record all unique values that have been hashed to a particular value, then performance would drop down to that of direct encoding (or worse).
If the unique-value file was required to record all unique values
>> I say we should not record. The question is more how we design a unique parquet file in case of using simple hashing with num_buckets >=1
arg, for exp? what we will write for the original item_id? the hashed value? and the index would be encoded bucket id? what's your idea for this situation?
If the unique-value file was required to record all unique values that have been hashed to a particular value, then performance would drop down to that of direct encoding (or worse).
Hmm, what I was thinking may or may not make sense. Let me see if I can lay out my thought process and see where we diverge:
Maybe the difference in what we're thinking comes from the context difference between "in GPU memory" and "in storage somewhere (device memory, disk, maybe on the other end of a network call)"?
Please keep in mind that I have much less experience with the production/serving side of things, so I'm sorry if I'm being naive.
Hashing high-cardinality categoricals like ids helps to reduce the size of embedding tables in the resulting models
Yeah, that makes sense. It is probably important to establish that there are two big reasons why we may need hashing. The first is to optimize the size of embedding tables - I'm assuming this motivation kicks in even when it is still possible to comfortably fit all unique values in memory. The second is to get around the fact that a truly high-cardinality feature may be too large to fit in memory (or it is just large enough to put you at risk of OOM errors).
For the first case, it may be beneficial to take some extra time in the NVT fit
to record the mapping between each hash value to the list of all known "real" values. However, this is definitely not something you would want to require Categoify to do for all cases. It is also only beneficial if you need to be able to map a hashed feature to the list of all known "real" features.
We're also reasonably likely to use the unhashed ids to look up features at serving time (e.g. from a feature store, Redis, etc)
Can you explain this a bit more? Are you just saying that you will often be working with "real" rather than encoded features at serving time? If so, makes sense to me.
Having a way to put the raw id, categorified id, and hashed id next to each other in allows us to avoid running Categorify at serving time and replace it with an extra column returned by a lookup we'd already do anyway
It seems to me like hash-based encoding should be the one case where the fit stage of Categorify should not even need to read-in and use the "unique-value" file. I realize this may not be the case now, but all you need to do is map null values to 1
, and then encode all other values by doing a hash
+ mod(num_buckets)
+ add(3)
. Therefore, it seems like running Categorify could be much faster than reading in and joining a lookup table.
Some notes to establish my current understanding of what the unique-value file should look like for various Categorify
scenarios...
Assume the following raw column x
:
x
0 Edith
1 Alice
2 <NA>
3 Bob
4 Charlie
5 Edith
6 Frank
7 <NA>
8 Dan
9 Edith
10 Frank
11 Gary
12 Bob
I would expect a default (direct) Categorify
fit to produce:
x x_size
1 <NA> 2
2 <NA> 0
3 Edith 3
4 Bob 2
5 Frank 2
6 Alice 1
7 Charlie 1
8 Dan 1
9 Gary 1
Here, we see that the index of 0
is reserved for padding, and is therefore completely excluded from the file. Index-value 1
is included to record the null-value count, and index 2
is included to record the OOV count. However, for a default/direct encoding, there should never be any OOV values.
max_size
If the Categorify
operation is modified to use max_size=4
, I would assume that we are essentially taking the exact same result, but then leaving out rows > 4 and adding the dropped "x_size"
sum to the OOV row:
x x_size
1 <NA> 2
2 <NA> 6
3 Edith 3
4 Bob 2
Note that a very similar logic would be used for frequency capping. That is, we start with the full mapping, and then clip the table down to the necessary size.
[EDIT: Is it correct to map the clipped values to OOV, or should we be introducing a distinct "infrequent-values" index?]
num_buckets
If the user specifies num_buckets=4
, then we should probably avoid writing an "x"
column in the file, and only include "x_size"
. In this case, the unique value file should only be used to record value-count statistics, becasue the actual raw-to-encoded value mapping is simply add(mod(hash(raw), num_buckets), 3)
for all cases.
x_size
1 2
2 0
3 6
4 3
5 3
6 1
Please note that this is the scenario that I am most uncertain of. I am making the assumption that the current options in Categorify
simply allow users to combine hashing with other options that we simply shouldn't allow them to be combined with. That is, it doesn't really make sense to mix freq_threshold
or max_size
with num_buckets
, because the reverse mapping becomes meaningless. Once the user chooses hashing, it no longer makes sense for the mapping to depend on the value-count or desired table size. Please do correct me if I have this wrong. I strongly suspect that I am missing an obvious exception to my thinking :)
For the default scenario, can not the item_id be 2
we'd know these are for OOVs.
x x_size
1 <NA> 2
2 <OOV> 0
3 Edith 3
4 Bob 2
5 Frank 2
6 Alice 1
7 Charlie 1
8 Dan 1
9 Gary 1
For the default scenario, can not the item_id be at the index 2 (like below)? why it reads as ?
Are you asking why I have <NA>
in the OOV row? If so, this is because the "value" for the OOV row would need to be null. I don't think we should arbitrarily choose a string or numerical value to represent OOV.
we wont have any OOVs in training but in the valid set we can, therefore in the valid set when we see encoded categories (e.g. item_ids) as 2 we'd know these are for OOVs.
Yes it makes sense to reserve index 2
for OOV, and that is what I was attempting to show (1
is for nulls, and 2
is for OOV - However, it doesn't make sense to include a literal "value" for either of these rows).
I strongly suspect that I am missing an obvious exception to my thinking :)
It seems like one case that probably complicates my basic understanding is that of "frequency hashing". Judging from the design-doc, it seems like this is the primary case where there is a mix of direct and hash-based encoding. Therefore, the idea to leave out the "x"
column in the unique-value file may be problematic. I suppose any rows corresponding hash buckets could just include null values in the "x"
column? E.g. for freq_threshold=2
and num_buckets=2
:
x x_size
1 <NA> 2
2 <NA> 0
3 <NA> 1
4 <NA> 3
5 Edith 3
6 Bob 2
7 Frank 2
where index 3
and 4
correspond to hash buckets where infrequent values are mapped to instead of OOV. I can't say I really understand why this would be more useful than mapping infrequent values to OOV, but I can understand the motivation to provide the option.
This leads me to a simpler question: Should max_size
and freq_threshold
always result in a distinct row (or more for num_buckets>1
) for "infrequent values"? My example above assumed the unfrequent values should be mapped to the OOV index, but I am now realizing the design doc is calling for slightly-more complex behavior.
Perhaps the simplest way to deal with the overlap between "OOV" and "infrequent buckets" is to use num_buckets
to determine the number of rows (starting with index 2
) to use for OOV mapping. For example, if max_size
or freq_threshold
is used while num_buckets
is set to 1
or None
, we only use index 2
for OOV mapping:
x x_size
1 <NA> 2
2 <NA> 6
3 Edith 3
4 Bob 2
Here, index 2
corresponds to the 0th (and only) OOV "bucket".
However, if num_buckets
is set greater than 1
(say num_buckets=3
), we can simply use indices 2:2+num_buckets
to map the "infrequent" values to OOV buckets (using hashing):
x x_size
1 <NA> 2
2 <NA> 2
3 <NA> 1
4 <NA> 3
5 Edith 3
6 Bob 2
We're also reasonably likely to use the unhashed ids to look up features at serving time (e.g. from a feature store, Redis, etc)
Can you explain this a bit more? Are you just saying that you will often be working with "real" rather than encoded features at serving time? If so, makes sense to me.
Yeah, pretty much. We'll get some raw/unprocessed entity id in a request (like a user's UUID), use that id to fetch some unprocessed entity attributes from a key/value store (which often lives on the other side of a network request), then feed those attributes through some feature processing with NVT before they get to the model. The K/V store is basically a collection of lookup tables anyway, so we may be able to do the translation from raw id to processed id as part of the K/V store request, if we produce output in a format that could be loaded there and fetched on a per-row basis instead of keeping the whole lookup table in GPU memory.
Having a way to put the raw id, categorified id, and hashed id next to each other in allows us to avoid running Categorify at serving time and replace it with an extra column returned by a lookup we'd already do anyway
It seems to me like hash-based encoding should be the one case where the fit stage of Categorify should not even need to read-in and use the "unique-value" file. I realize this may not be the case now, but all you need to do is map null values to 1, and then encode all other values by doing a hash + mod(num_buckets) + add(3). Therefore, it seems like running Categorify could be much faster than reading in and joining a lookup table.
I think this may be a "different contexts" kind of thing. At serving time we don't run the fit
at all, wouldn't want to read in or hold the table in memory, and definitely don't want to do any joins. If possible, we don't even want to work with dataframes at all, since we'll have a tiny number of rows. The ideal case is that we send a short list of ids to a data store and it hands us back the data associated with them. Since it's possible to store very large data with tools like Redis/PostGres/Cassandra/BigTable and we'd probably be doing that already, I'm not worried about the space an extra column might take up.
We're currently using an approach like this in our multi-stage serving example, where we use NVT to Categorify ids and then we store both the raw and processed id in the feature store. That gives us a way to map from one to the other just by doing a look-up, which also allows us to do "reverse Categorify" without loading the unique values file into memory.
I think this may be a "different contexts" kind of thing. At serving time we don't run the fit at all, wouldn't want to read in or hold the table in memory, and definitely don't want to do any joins. If possible, we don't even want to work with dataframes at all, since we'll have a tiny number of rows. The ideal case is that we send a short list of ids to a data store and it hands us back the data associated with them.
This totally makes sense, and is mostly consistent with what I was thinking. At serving time you could certainly use a feature store to encode raw data, and use simple hashing (and update the K/V store) if there is a key miss. If you use the Categorify encodings up-front to populate the k/v store, then you shouldn't have a key miss for data you have already seen in your training set.
What I'm saying we probably don't want to do is require Categorify.fit
to aggregate a list of all unique values when we already know that they are being mapped by hash. The need to aggregate high-cardinality values is a known OOM pain point in NVT, and simply mapping a function to update a k/v store (by hash) would be much more reliable.
I'm not worried about the space an extra column might take up.
Just want to clarify that the concern has nothing to do with adding a column to the parquet file, but with the need to actually aggregate the extra data to write in that column (probably across multiple parquet files). For a dataset with billions of unique values, we are talking about aggregating gigabytes of extra data. My intuition is that you probably do want to provide a formal mechanism to record a mapping for all encountered unique values. However, I'm trying to discourage this mechanism from becoming a row in the unique-value parquet file that is currently used by Categorify.
I'm trying to discourage this mechanism from becoming a row in the unique-value parquet file that is currently used by Categorify.
Yeah, I think that makes sense. It seems like the unique value file is semi-implicitly taking on more functionality than it was intended to have, so I wonder if there are potentially multiple outputs/artifacts to be created from Categorify
.
It seems like the unique value file is semi-implicitly taking on more functionality than it was intended to have
Right, I think that is the gist of what I am trying to communicate.
I could be missing it, but I'm not sure the current implementation of Categorify
for inference time supports any functionality beyond loading the unique values file and applying the mapping from it, so that's part of what's driving the push to put everything in the unique values file.
I could be missing it, but I'm not sure the current implementation of Categorify for inference time supports any functionality beyond loading the unique values file and applying the mapping from it, so that's part of what's driving the push to put everything in the unique values file.
I haven’t looked at the cpp/inference code before, but you may be right about that. I’m probably proposing that the current Categorify
code should be revised if we are going to establish new official encoding conventions anyway.
For the case of simple hashing, it makes no sense to me that we are currently writing out a full unique-value file, because the index of that file does not correspond to the hash bucket for each row. I'm not sure if the inference code is using that file anyway, but the _encode
function (used by Categorify.transform
) is essentially reading in the file (which was memory-intensive to create) and then completely ignoring it (using the result of _hash_bucket
instead).
Hey folks. Thanks for this fruitful discussion about the proposal. I gonna start replying some of you comments from the top. P.s. Sorry for my delayed answer (was out for 2 weeks for vacation) and working in other stuff when back.
For example, what should this file look like if the user specifies
max_size=2
orfreq_threshold=2
? Do we then include a row for OOV, and include the value count for values that will now be mapped to 1? E.g.x x_size 1 <NA> 1 2 <NA> 1 3 cat 3 4 dog 2
@rjzamora No. We would always reserve ids 0-2 for padding and OOV and nulls. We could have them in the unique mapping parquet file to make it clear (with index starting at 0). The non-default options affect what happens starting in index 3. In your example with max_size=2 or freq_threshold=2, I think we should return
x x_size
0 <PAD> 0
1 <OOV> 0
2 <NULL> 1
3 <INFREQ> 1 (all infrequent items would be mapped to this index, in this case "fish")
4 cat 3
5 dog 2
I am also confused about cases like num_buckets=2
. Will we then store hash values in the file? E.g.
x x_size 2 2654435768 1 3 110891232 3 4 1341686676 2 5 2980454984 1
@rjzamora We could just have a token like "max_size
or freq_threshold
were used together with num_buckets
, that will mean that "
If so, how would we include an OOV-value count when
max_size
orfreq_threshold
is also specified?
@rjzamora I think we'll only have values encoded as
[EDIT: Is it correct to map the clipped values to OOV, or should we be introducing a distinct "infrequent-values" index?]
@rjzamora Yes, we should be using separate indices to OOV (2) and hashed infrequent values (starting in index 3)
Please note that this is the scenario that I am most uncertain of. I am making the assumption that the current options in
Categorify
simply allow users to combine hashing with other options that we simply shouldn't allow them to be combined with. That is, it doesn't really make sense to mixfreq_threshold
ormax_size
withnum_buckets
, because the reverse mapping becomes meaningless. Once the user chooses hashing, it no longer makes sense for the mapping to depend on the value-count or desired table size. Please do correct me if I have this wrong. I strongly suspect that I am missing an obvious exception to my thinking :)
@rjzamora The user might be interested in combining freq_threshold or max size with num_buckets>1. The reason is that with num_buckets you can reduce the number of collisions, as not all infrequent items will be hashed to the same bucket/index. And reducing collisions, we are in theory able to train better embedding. In models that use hashing to reduce cardinality of feature interactions, typically we get better accuracy by allowing hashing to use more buckets (in the order of tens or hundreds of thousands).
Are you asking why I have
<NA>
in the OOV row? If so, this is because the "value" for the OOV row would need to be null. I don't think we should arbitrarily choose a string or numerical value to represent OOV.
@rjzamora I think it could be good to use literal string for representing the special tokens, so that when someone wants to use the uniques parquet for reverse categorify, by just looking up the predicted (encoded) item ids using the uniques table they would be able to different between what is "
This leads me to a simpler question: Should
max_size
andfreq_threshold
always result in a distinct row (or more fornum_buckets>1
) for "infrequent values"? My example above assumed the unfrequent values should be mapped to the OOV index, but I am now realizing the design doc is calling for slightly-more complex behavior.
@rjzamora Yes, having more buckets for infrequent items reduces collision, as I explained above. I understand that is the current behavior of Categorify, when you use freq_threshold/max_size>0 and num_buckets>0. I mean, that is not a new functionality right?
I think it could be good to use literal string for representing the special tokens
I'm pretty sure I will be pushing back on this pretty strongly. This will become a pretty big mess, unless you are requiring all categorical columns to be represented as a string dtype. There are plenty of cases where other dtypes (like integer or timestamp) are more useful.
Is have serious doubts that we will want to have special <OOV>
or <INFREQ>
values for the unique-value parquet file.
Perhaps the simplest way to deal with the overlap between "OOV" and "infrequent buckets" is to use
num_buckets
to determine the number of rows (starting with index2
) to use for OOV mapping. For example, ifmax_size
orfreq_threshold
is used whilenum_buckets
is set to1
orNone
, we only use index2
for OOV mapping:
@rjzamora You are right. If we use num_buckets>0, then we will never have OOV, because any value (known or not) will be hashed to a bucket in workflow.transform(). So, when num_buckets>0 your proposal would be starting the hashed bucket at index 2 (position originally reserved for OOV), right? It makes sense.
I have been thinking that we should always reserve the first 3 positions of the table for
I have been thinking that we should always reserve the first 3 positions of the table for , , and that any other encoded value would start at index 3 in all cases, just to make that standard clear and easier for users.
Yeah, that motivation made a lot of sense to me as well until I realized that num_buckets
is really synonymous with "the number of OOV buckets to use," and so it is actually a bit more confusing to have a reserved OOV row in addition to hash-bucket rows.
Yeah, pretty much. We'll get some raw/unprocessed entity id in a request (like a user's UUID), use that id to fetch some unprocessed entity attributes from a key/value store (which often lives on the other side of a network request), then feed those attributes through some feature processing with NVT before they get to the model. The K/V store is basically a collection of lookup tables anyway, so we may be able to do the translation from raw id to processed id as part of the K/V store request, if we produce output in a format that could be loaded there and fetched on a per-row basis instead of keeping the whole lookup table in GPU memory.
I think this may be a "different contexts" kind of thing. At serving time we don't run the
fit
at all, wouldn't want to read in or hold the table in memory, and definitely don't want to do any joins. If possible, we don't even want to work with dataframes at all, since we'll have a tiny number of rows. The ideal case is that we send a short list of ids to a data store and it hands us back the data associated with them. Since it's possible to store very large data with tools like Redis/PostGres/Cassandra/BigTable and we'd probably be doing that already, I'm not worried about the space an extra column might take up.We're currently using an approach like this in our multi-stage serving example, where we use NVT to Categorify ids and then we store both the raw and processed id in the feature store. That gives us a way to map from one to the other just by doing a look-up, which also allows us to do "reverse Categorify" without loading the unique values file into memory.
@karlhigley I think that storing the encoded values by Categorify in a Feature Store, not requiring to perform categorify transform on serving is a valid solution and we should support that.
If I understood correctly, in case the customer uses another Feature Store than the one we integrate in Merlin Systems (Feast), I understand they would have to implement that solution themselves, i.e., taking the preprocessed data from NVTabular and saving to their Feature Store the encoded ids for all categorical features, right?
And for reverse categorify, would they either use the uniques parquet file, or create their own raw<->encoded item id mapping by computing unique values (e.g. df[["raw_id", "encoded_id"]].unique()
) in the preprocessed parquet files?
I think we should support reverse categorify by just using the unique parquet file. Either if the item id cardinality is large, for example 100M itens the mapping table would occupy 1.14GB (assuming encoded ids are int32 and original ids are int64). P.s. If original ids are strings, then it would be more and maybe a Feature Store / database would be a better place for users doing the reverse categorify themselves.
I think we should support reverse categorify by just using the unique parquet file.
I'm not clear on what it would mean to enable reverse lookup from the parquet-file alone if the parquet file includes more than one hash bucket? Or, is that statement only for the case where there is no hashing?
I think that storing the encoded values by Categorify in a Feature Store, not requiring to perform categorify transform on serving is a valid solution and we should support that.
Pretty sure @nv-alaiacano will push back on that for reasons I find persuasive, but that's what we're currently doing in our multi-stage example notebook. My bigger concern is supporting the other case where we store untransformed values, because in order to do that successfully we'd need to be able to run Categorify online at serving time fast enough to achieve reasonable latency—which currently requires a separate C++ implementation of the operator that I'm not sure matches the main implementation, and am nearly 100% certain won't match the main implementation after these changes (which I support, to be clear.)
I think we should support reverse categorify by just using the unique parquet file. Either if the item id cardinality is large, for example 100M itens the mapping table would occupy 1.14GB (assuming encoded ids are int32 and original ids are int64).
I'm with you in general, but I think @rjzamora's point about the unique values file taking on too many responsibilities and the difficulties that creates in the hashing case are valid too. I'm wondering if we need multiple output files to serve the needs of different parts of Merlin.
@gabrielspmoreira - What about the following variation of the proposal (at least, as I understand it so far)?
We move to the following conventions for the "unique-value" Parquet file:
encodings_<field-name>.parquet
to reflect that the index directly corresponds to the expected encoding
encodings_x.*.parquet
if/when needed.RangeIndex
(corresponding to the actual integer encoding), and three required columns: x
, x_kind
, and x_size
x_list
column to record observed values during the fit stage. This column will not be written by default, because it may cause significant memory pressure to aggregate these values. However, users who are okay with the performance overhead may collect these observed values before they are hashed into buckets (and therefore make reverse mapping possible from the encodings_x.parquet
file alone). Another possibility would be to optionally write this information in a separate bucket_encodings_x.parquet
file. Assume the following x
columns (and the presented modular hash values) for all these examples:
x hash%2 hash%4
0 cat 1 3
1 fish 0 2
2 cat 1 3
3 None 1 3
4 dog 0 2
5 bird 1 1
6 dog 0 2
7 snake 1 1
For simple/direct encoding, the encodings_x.parquet
file(s) will reserve the
first three indices for "special" encodings (pad, null, and oov):
x x_kind x_size
0 None pad 0
1 None null 1
2 None oov 0
3 dog literal 2
4 cat literal 2
5 fish literal 1
6 bird literal 1
7 snake literal 1
For frequency capping (or table-size capping), the "oov" row will be relabelled as "infreq":
x x_kind x_size
0 None pad 0
1 None null 1
2 None infreq 3
3 dog literal 2
4 cat literal 2
There can be an arbitrary number of infrequent-value buckets:
x x_kind x_size
0 None pad 0
1 None null 1
2 None infreq 1
3 None infreq 2
4 dog literal 2
5 cat literal 2
Users may opt into the construction of an observed-value list column. This will certainly increase the computational cost of Categorify.fit
, but may be worthwhile for some use cases. Another note that we could also write this information out in a distinct file.
x x_kind x_size x_list(optional)
0 None pad 0 None
1 None null 1 [None]
2 None infreq 1 [fish]
3 None infreq 2 [bird, snake]
4 dog literal 2 [dog]
5 cat literal 2 [cat]
x x_kind x_size x_list(optional)
0 None pad 0 None
1 None null 1 [None]
2 None bucket 0 []
3 None bucket 2 [bird, snake]
4 None bucket 3 [fish, dog]
5 None bucket 2 [cat]
@rjzamora I think your examples are great and I like your proposal in general! Just have some minor comments/questions.
1) I was thinking we could have a single x
column that combines both original values and special ones like (pad, null, oov, bucket, infreq,...). You propose creating a new x_type
column to store the encoding type / special tokens separatelly. I think it is more organized, and also if the original values dtype are not string (e.g. int), storing strings with the special tokens in the same column x
as the values could be an issue.
My only concern adding x_type
column is using a bit more memory when keeping that table in memory for reverse categorify. Maybe we could replace use None
instead of "literal"
for the non-special encoded values, to save some space?
2) I don't understand the need of having the x_list
column. It might contain very large lists for high cardinality items and small number of hashing buckets. And it doesn't allow reversing back hashed values to the original ids anyway.
I think we don't need to look for a solution for reversing back hashed values to the original item id, because that should be a rare case in retrieval models that return top-k predicted item ids. Let me explain why.
Typically, we won't use Categorify(num_buckets>0)
alone for item id, because that would hash all items into buckets and the model would be predicting top-k buckets instead of top-k items. In general, when we want to reduce the cardinality of item id, we use either max_size
or freq_threshold
to keep the most popular items "unhashed" and want to hash the very long tail of items that doesn't have a bare minimum of interactions (e.g. 5 or 10) so that meaningful embeddings can be trained. As the models tend to predict among top-k items the somewhat frequent items, I believe it would be rare having the model predicting among top-k some ids corresponding to the hashing buckets reserved for infrequent items (because they are infrequent targets during model training). And if that happens, the requester system should be able to know that an item is a hashed infrequent value, to either ignore it or replace it by another item recommendation using some business rule.
Does that make sense?
Good questions @gabrielspmoreira !
My only concern adding x_type column is using a bit more memory when keeping that table in memory for reverse categorify. Maybe we could replace use None instead of "literal" for the non-special encoded values, to save some space?
Internal code should not need to read in the "x_kind"
column, as it would only be for users who want to inspect the encoding_x.parquet
file. However, you are correct that this would take up more memory when writing the file, and would require a bit more space on disk. I was assuming that we wold use a categorical dtype in memory (not strings/objects), and that the column would be dictionary-encoded in parquet (meaning that the file would only store the unique strings ("null", "oov", etc.) and would use a light-weight integer array for the encoding of each element. With that said, its always less memory/storage to use nan in places where a literal value is not necessary (e.g. "literal").
Regarding x_list
: I would be very happy not to support this if it is not needed :)
Makes sense @rjzamora .
Just to explain what I see as a reverse categorify op we might have in Systems (or that our customers might want to build themselves) ...
The reverse categorify op would load the encoding_x.parquet file and keep it in memory in some dataframe or dictionary data structure. Then, when it receives as input some top-k (hundreds or thousands) predicted encoded item ids, that op would just perform a simple lookup in the encoding mapping to obtain the original categorical values, which might be either ints or strings (e.g. product SKUs). If some of the top-k ids are special encoded values (e.g. oov, null, hashed), that op should have some standard token (e.g. "
If some of the top-k ids are special encoded values (e.g. oov, null, hashed), that op should have some standard token (e.g. "", "", "" to return to the requester system (as there is no original id to return in such cases). Does that make sense?
Yes, I think this makes sense to me. If you use encoding_x.parquet
to directly look up the original value for some encoding you would get back either a literal value or nan
. If the result is a literal value, the reverse-encoding logic would just return that value. If the result is nan
, the reverse-encoding logic could return some "special" string value to distinguish between "null" and "oov", "infreq", or "bucket". Is that what you are saying?
My only concern adding x_type column is using a bit more memory when keeping that table in memory for reverse categorify.
Just to come back to this for a moment - Including this extra column is definitely not something I really want to do, because it does add approximately one byte per row (assuming we use a categorical dtype). I guess the only reason I proposed it is that I feel even less confident about using special tokens in the x
column. In my mind, the most efficient approach is to simply leave out the x_type
column, and use null values in the x
column for anything that is not a direct/literal encoding.
This shouldn't hinder reverse mapping in any way, because 0
will always be for padding, 1
will always be nulls, and then everything up until the first non-null x
row will be for oov/infreq/bucket encoding (the specific meaning depends on the encoding options, but the meaning can also be inferred by the starting index and the existence of non-null values in x
).
We don't necessarily have to hold the data for reverse categorify in memory in the same format it's stored on disk (or at all, if we store the majority of it in a data store we're going to query anyway.)
Okay, given the useful comments and information from both @gabrielspmoreira and @karlhigley, I'm thinking the best solution here is to keep the encodings.x.parquet
standard as simple as possible. That is,
RangIndex
to specify the encoded integer values.encodings.x.parquet
data must include an x
column (where x
is the name of the column or column group), and an x_size
column:
x
column shall use a null value for any category that does not correspond to a direct unique-value mapping.x_size
column should include the observed-value counts collected during the original Categorify.fit
operation.0
encoding will be reserved for padding1
encoding will be reserved for null values2 : 2 + num_buckets
will be used for OOV, infrequent values, or hash buckets. Note that the default and minimum value for num_buckets
shall be 1
.2 + num_buckets
and greater will be used for literal unique-value encodings.As far as I understand, this standard makes the encoding rules and results pretty clear. Embeddings 0
and 1
will always mean the same thing (padding and nulls), and so the value of num_buckets
can be easily inferred from the number of null values in the x
column (num_buckets = null_count - 2
in all cases). We can also infer the number of unique-value encodings from the number of non-null values in the x
column. This all means that a simple Merlin utility could easily use this file to provide a reverse mapping (as long as they don't expect us to specify a list of all possible values observed for a specific "bucket" encoding - where we would probably want the utility to return either "infreq" or "hashed", depending on the value of num_buckets
).
@rjzamora I am a bit confused of 2 + num_buckets and greater will be used for literal unique-value encodings.
. are we making num_buckets
a required arg to be set? or it is optional? what will be the default value for it? None
? This arg should not be used/set in a native/plain Categorify()
. It should be None.
do I understand correctly?
and what happens to max_size
, freq_threshold
args?
Good questions @rnyak - I am suggesting that it should be optional for the user to pass in num_buckets
, but that the default value should be 1
. However, the other Categorify
kwargs/defaults may need to be tweaked a bit to better-clarify the fact that setting num_buckets
does not mean that we will forgo unique-value collection, but that we will use num_buckets
(with hashing when num_buckets>1
) to split up any categories that do not satisfy max_size
or freq_threshold
.
To this end, I suppose the max_size
and freq_threshold
should probably correspond to direct encodings only. This way the user would be required to set something like max_size=0
to ensure that all encodings (besides pad/null) are hash buckets.
Does this make sense?
Good questions @rnyak - I am suggesting that it should be optional for the user to pass in
num_buckets
, but that the default value should be1
. However, the otherCategorify
kwargs/defaults may need to be tweaked a bit to better-clarify the fact that settingnum_buckets
does not mean that we will forgo unique-value collection, but that we will usenum_buckets
(with hashing whennum_buckets>1
) to split up any categories that do not satisfymax_size
orfreq_threshold
.To this end, I suppose the
max_size
andfreq_threshold
should probably correspond to direct encodings only. This way the user would be required to set something likemax_size=0
to ensure that all encodings (besides pad/null) are hash buckets.Does this make sense?
just to clarify. when you say num_buckets=1
does that mean we are not gonna do any hashing/bucketing right? we will do vanilla encoding, meaning every category will be encoded to a unique value?
This way the user would be required to set something like max_size=0 to ensure that all encodings (besides pad/null) are hash buckets.
max_size =0
that is interesting and a bit confusing. max_size
means we want to categorify max_size
of most frequent values for a column will be encoded regularly, and the rest will go to a bucket / mapped to a single value?
considering that, if we set max_size =0, what's that mean? does it mean all categories are mapped to one single value? and what's the point of doing that?
just to clarify. when you say num_buckets=1 does that mean we are not gonna do any hashing/bucketing right? we will do vanilla encoding, meaning every category will be encoded to a unique value?
Short answer is yes.
Longer answer: I'm suggesting that what we are currently calling num_buckets
should probably mean something a bit different than it does right now, and that it should only indicates the number of indices we are reserving for "out-of-vocabulary" encodings (which includes "infrequent" values and hash buckets, since both infrequent values and hash-buckets are not a part of the reversible vocabulary).
The picture I have in my mind is that there are really only four types of encodings to consider:
considering that, if we set max_size =0, what's that mean? does it mean all categories are mapped to one single value? and what's the point of doing that?
Yes, I agree. I don't think anyone would want to set max_size=0
with the default num_buckets=1
. I'm suggesting that people who don't want/need reversible encoding for a particular column may want to combine max_size=0
with num_buckets>>1
(to hash everything).
Update: I have started to use https://github.com/NVIDIA-Merlin/NVTabular/pull/1692 as a space to experiment with the ideas discussed here. I ultimately found that the simplest approach is to move away from the idea that we should include more information in "unique.x.parquet"
. Instead, I now feel that we should simplify that file even further, and include a simple summary of embedding information in a distinct "meta.x.parquet"
file.
For example, when using freq_threshold=3
and num_buckets=10
together, we would write something like the following for a hypothetical "Author"
column:
Contents of "unique.Author.parquet"
:
Author Author_size
12 User_B 3
13 User_C 3
Here, we see that only two unique values satisfied the freq_threshold=3
limit in the "Author"
column, and since we are reserving index 0
for padding, index 1
for nulls, and indices [2,12)
for oov hashing, the RangeIndex
offset is 12
. Note that this "unique.Author.parquet"
file only includes information about in-vocabulary unique values. It does not include information about nulls or indistinct categories. If the user wants more information about null or indistinct categories, they would need to inspect the corresponding "meta.Author.parquet"
file.
Contents of "meta.Author.parquet"
:
kind offset num_indices num_observed
0 pad 0 1 0
1 null 1 1 0
2 oov 2 10 4
3 unique 12 2 6
Here, we find a summary of the high-level encoding metadata. The "kind"
column specifies the kind of category that row corresponds to.
The "offset"
column specifies the index offset for that kind of category. For example, "pad" will always have an offset of 0
, "null" will always have an offset of 1
, and "oov" will always have an offset of 2
. The "unique"-row offset should always match the RangeIndex
offset observed in the "unique.x.parquet"
file (which is 12
in this case).
The "num_indices"
column specifies how many encoding indices have been reserved for each kind of category. Note that the "oov" row should always have a "num_indices" value that matches the num_buckets
setting (or 1
), while the value in the "unique" row will match the length of the "unique.x.parquet"
file (2
in this case).
The "num_observed"
column specifies the total number of occurrences observed for each "kind" of category during the Categorify.fit
stage. This means that the sum of all values in the "num_observed"
column must match the total number of rows in the dataset used to perform the fit. This also means that we can use this row to clearly distinguish the null-value count from that of the infrequent categories (which fall into the "oov" category since we do not record them in our "unique.x.parquet"
-file vocabulary).
@gabrielspmoreira @rnyak @karlhigley - Let me know if you think a solution like this makes sense to you. Note that I tried recording all of this information in the same "unique.x.parquet"
file, and I'm pretty confident that we want to avoid that solution.
I think this makes sense, but we're definitely going to need to provide functionality elsewhere in Merlin to do an explicit reverse mapping. From experience, I can promise that some user somewhere out there is going to look at the unique values files, try to write their own mapping, ignore the index, and complain to us that Categorify isn't producing the right values because it doesn't line up with the rows in the unique values file (if one assumes the first row is row zero, which it clearly isn't in your example.)
I think we're in a pretty good place now to get the reverse mapping done as a generic Merlin array/tensor operator that can run almost anywhere (except in NVT, at least for now.)
we're definitely going to need to provide functionality elsewhere in Merlin to do an explicit reverse mapping.
Agree - Though the logic would not need to be very complex with this latest design.
From experience, I can promise that some user somewhere out there is going to look at the unique values files, try to write their own mapping, ignore the index, and complain to us that Categorify isn't producing the right values because it doesn't line up with the rows in the unique values file (if one assumes the first row is row zero, which it clearly isn't in your example.)
Yeah, I totally agree that this solution is still likely to confuse some users. With that said, this last design is certainly less confusing that what we have now, and it is much better for the specific user-provided vocab example you are describing (since the unique.*.parquet
mapping is not expected to include null or oov information). I don't think it would be too difficult to dramatically improve the docstring for providing a user-provided vocabulary, and make it completely clear that we will always start the unique-value index after oov/hash buckets.
It sounds like you think it is problematic that we are always reserving 0
, 1
. and 2+
for specific purposes - even when the user provides a specific unique-value mapping? Would it be more intuitive to start the null and oov indices after the unique-value encodings? I don't have any strong opinions about this at all - Just happened to run with the original suggestion to make the pad/null/oov-encodings consistent for all cases.
Oh, I don't have a problem with any of this or any suggestion for making it better! I'm just saying that we (me, in all likelihood) need to cut users off at the pass by providing reverse mapping functionality before they try to implement reverse mapping themselves again, because I've already seen that go weirdly once or twice. These changes entail more work to do, but it'll happen outside of Categorify itself and is stuff we had reasons to do anyway. We just can't let it languish in the backlog much longer, especially with this overhaul happening.
I'm just saying that we (me, in all likelihood) need to cut users off at the pass by providing reverse mapping functionality before they try to implement reverse mapping themselves again
Yes, I agree that it will be best to provide a simple reverse mapping utility
Context
Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing. When Categorify runs, it creates a mapping between original and encoded values.
Problem
There are currently some issues in the encoding of Categorify:
start_index
ormax_size
#1736). There are more examples of these mismatches in this doc (Nvidia internal). It is important for Merlin that we make it more straightforward to map the encoded ids back to the original values by using just a mapping table, without the need to be aware of the complex logic inside Categorify to cover the hashing options available. In RecSys, reverse mapping is critical for item id, as that is what models will be predicting (encoded ids) and they need to be presented to the user (original ids).Proposed solution
This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details
Save a consistence mapping table to parquet
Fixing the first 3 values of encoding table: \<PADDING>, \<OOV>, \<NULL>
start_index
argument. It was created to allow for users reserving a range of control ids (the first N encoded ids) so that no original value is mapped to them by Categorify. Then, the user could do some post-processing after Categorify to set some values in that range. The only use case we found so far for that was reserving 0 for padding sequence features, which is common for sequential recommendation. So we removestart_index
and reserve just the id 0 for padding (or for other purposes the user might use it). That simplifies a lot the logic within Categorify as start_index shifts all other values.na_sentinel
as nulls will be always mapped to a single id (2).Hashing / Frequency hashing
max_size
(used for frequency capping and frequency hashing) totop_frequent_values
, so that users don’t assume that the value will be the maximum cardinality (as that one will also include the special ids and the num_buckets).Pre-defined vocabulary
vocabs
argument with a pre-defined mapping (from original values to encoded ids) our encoding standard does not apply and we just use that mapping. We should reserve an extra position in the end of their mapping table to assign values that are potentially not found (including nulls) in thevocabs
mappingProposed encoding strategy