Closed reuster986 closed 2 years ago
@reuster986, I've been thinking about how to approach this and it's a bit tough because my opinion changes for different cases.
For slice/gather case like the reproducer, the constructor eliminating unused categories feels right. I think we should do this either way
It's really .from_codes
which raises concerns with this approach. Because if we're telling the user they can specify the categories and then we go and remove some of them... that seems bad to me? Maybe that's fine, but if we take that approach we should update documentation to say that categories not present in the Categorical will be removed
I think the safest thing is the first option of updating methods that assume all categories are present. It's not ideal that we'd have to do explicitly find the categories every time one of those methods is called. I've not got a good sense of how much of a performance hit that would be. If it's significant, maybe we do the .exhaustive
flag and only explicitly calculate the categories when it's false
ak.in1d(a, b)
gives the wrong answer whena
andb
are categoricals andb
was constructed as a slice/gather. Reproducer:This assertion fails because
test
is allTrue
. The reason is because theCategorical.in1d
method was written with the assumption that every member of.categories
is present in a Categorical array, but this is not the case when a Categorical is constructed from a slice/gather from another Categorical, or using the.from_codes
constructor.Specificially, when
cat12
is created from the slicecat[1:3]
, it inherits the fullcat.categories
, even though only two of them are present in the array. Then, when execution reaches the code block below,test
/cat12
has the same categories asself
/cat
, so the code infers that all the elements ofself
/cat
are intest
/cat12
.https://github.com/Bears-R-Us/arkouda/blob/20906e9fdc9918e6e7533176d1b3b48caf9a0754/arkouda/categorical.py#L453-L457
Two possible solutions that aren't mutually exclusive:
.in1d
that assume all categories are present in the array and fix them to explicitly find the set of categories.exhaustive
that tells whether all categories are actually present@pierce314159 and @glitch , I'm interested in your thoughts on how we should approach this.