apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

Relax pyarrow.compute.is_in type requirement #34982

Open rohanjain101 opened 1 year ago

rohanjain101 commented 1 year ago

Describe the enhancement requested

In the following example:

>>> a = pa.array([1,2,3], type=pa.int8())
>>> pa.compute.is_in(a, pa.array([255], type=pa.int64()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ps_0310_is\lib\site-packages\pyarrow\compute.py", line 256, in wrapper
    return func.call(args, options, memory_pool)
  File "pyarrow\_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 255 not in range: -128 to 127
>>>

It would be nice if this could return false instead of raising an error, given that its impossible for 255 to be in an int8 array. This would be more in line with isin in Pandas as well:

>>> a = pd.Series([1,2,3], dtype="int8")
>>> a.isin([255])
0    False
1    False
2    False
dtype: bool
>>>

Component(s)

Python

westonpace commented 1 year ago

It's a good idea but there isn't a good spot for this kind of optimization today so it'll be challenging to add. It might be possible to do this with implicit casting. Type coercion (implicit casting) happens today when expressions move from unbound (not tied to any input schema) to bound (tied to a specific input schema).

This type coercion is enabled per-function so maybe SetLookupFunction could have a custom DispatchBest implementation that first attempts to dispatch exact and, if that fails, changes to a custom "always return false" function :shrug:

Either way, we don't have much precedent for this kind of thing and I think one could argue that this should be solved higher up than Arrow compute in some kind of expression rewrite pass (this doesn't exist today either).