Bears-R-Us / arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
Other
236 stars 87 forks source link

Logic error in `Categorical.in1d` and other methods #990

Closed reuster986 closed 2 years ago

reuster986 commented 2 years ago

ak.in1d(a, b) gives the wrong answer when a and b are categoricals and b was constructed as a slice/gather. Reproducer:

s = ak.array([str(i) for i in range(10)])
s12 = s[1:3]
answer = ak.in1d(s, s12)
cat = ak.Categorical(s)
cat12 = cat[1:3]
res = ak.in1d(cat, cat12)
assert (answer == res).all()

This assertion fails because test is all True. The reason is because the Categorical.in1d method was written with the assumption that every member of .categories is present in a Categorical array, but this is not the case when a Categorical is constructed from a slice/gather from another Categorical, or using the .from_codes constructor.

Specificially, when cat12 is created from the slice cat[1:3], it inherits the full cat.categories, even though only two of them are present in the array. Then, when execution reaches the code block below, test/cat12 has the same categories as self/cat, so the code infers that all the elements of self/cat are in test/cat12.

https://github.com/Bears-R-Us/arkouda/blob/20906e9fdc9918e6e7533176d1b3b48caf9a0754/arkouda/categorical.py#L453-L457

Two possible solutions that aren't mutually exclusive:

@pierce314159 and @glitch , I'm interested in your thoughts on how we should approach this.

stress-tess commented 2 years ago

@reuster986, I've been thinking about how to approach this and it's a bit tough because my opinion changes for different cases.

For slice/gather case like the reproducer, the constructor eliminating unused categories feels right. I think we should do this either way

It's really .from_codes which raises concerns with this approach. Because if we're telling the user they can specify the categories and then we go and remove some of them... that seems bad to me? Maybe that's fine, but if we take that approach we should update documentation to say that categories not present in the Categorical will be removed

I think the safest thing is the first option of updating methods that assume all categories are present. It's not ideal that we'd have to do explicitly find the categories every time one of those methods is called. I've not got a good sense of how much of a performance hit that would be. If it's significant, maybe we do the .exhaustive flag and only explicitly calculate the categories when it's false