Closed bogdan-radu-nechita closed 1 year ago
starting from release 23.06, Categorify op is designed to encode the categories of a given column in such way
0
for padding (you wont see 0s in your encoded dataset)1
(if your dataset has nulls you will see them encoded as 1
)2
(if your validation or test set has out of vocabularies, then these categories will be encoded as 2
)3
(your most frequent category in a categorical column will be encoded as 3
the second most frequent as 4
and so on so forth)hope that answers your question.
My sincerest apologies... I was looking at outdated docs and code. Thank you for your time and again, please forgive my silly mistake!
I'm fairly new to the field, so thank you in advance for your patience and time!
When using Categorify with a merlin tensorflow container there is something going on within that function that causes (in my opinion) erroneous metadata to be saved to the schema of the dataset. More specifically\, the reported properties.embedding_sizes.cardinality for the columns is off by a constant. This causes the processed columns to start with the number 3 instead of 0 or 1 and MatrixFactorizationModelV2 to generate an extra set of 3 embeddings. Is this the expected behavior?
Containers used for testing this and the constants:
As environments:
I'll be more than happy to provide whatever information you need. I tried digging into it myself, but I am still green enough to not be able to figure it out.
Merlin-tensorflow 23.06
Merlin-tensorflow 23.04/23.02/22.12
Thank you again for your time and patience!