Open radekosmulski opened 1 year ago
If I understand the example, the behavior of Categorify
doesn't seem to be the issue. If the workflow stopped after the Categorify
, categorical would be the correct label (since the output of Categorify
is always categorical by definition.)
But since the workflow continues after that, the subsequent operators need to adjust the schemas of the columns they operate on in order to reflect the transforms they implement. It looks like Groupby
isn't correctly applying Tags.CONTINUOUS
when the aggregation is a count.
@jperez999, could you point someone who isn't you (@radekosmulski, @oliverholworthy, @nv-alaiacano, or @edknv maybe, if they're willing) at how to fix this in the Groupby
op?
@karlhigley @radekosmulski tagging count
column as Continuous might be problematic as well.. Since an integer value (count is an int), can be seen as a categorical value for a user. It does not have to be continuous. Therefore, I'd say we should not tag count
agg neither as continuous nor as categorical from Groupby op.
If a count column normalized afterwards, then it will take continuous
tag, if it is gonna be bucketed or hashed via categorify op then it can take categorical tag.
I think you're right about how things should ultimately be tagged, but I want to re-iterate that it's each op's responsibility to apply the appropriate tags based only on the transformation that op applies (not future or previous ops, who can take care of that for themselves.)
If the result of a count
aggregation is continuous, then Groupby
should tag it as such. If subsequent operations like Normalize
or hashing/bucketing operations change whether it's continuous, categorical, or an embedding, then those ops should apply whatever tag becomes appropriate after their transformations.
We don't want to be guessing what ops a user might apply subsequently in order to apply tags; we should just apply the currently correct tags and rely on subsequent ops to change them as needed.
@karlhigley I agree. What I wanted to say should count
be considered continuous or categorical? I say it could be both :)
I don't know either 😄
Describe the bug The output of the
Categorify
operator is tagged asTags.CATEGORICAL
even if those might just be counts.Steps/Code to reproduce bug Here is a full reproducer:
Expected behavior
count
columns are not tagged asTags.CATEGORICAL
Environment details (please complete the following information): current main