apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 64 forks source link

Regression in `substr` performance from 37.1.0 to 38.0.0 #712

Closed timsaucer closed 1 month ago

timsaucer commented 1 month ago

Describe the bug

In 37.1.0 and earlier the substr function in python called the built in scalar function Substr, which allowed for being called with either two or three arguments (string, start position, length). In 38.0.0 this now uses a macro that generates a single form that only takes two arguments and the user needs to call substring if they want to specify length.

To Reproduce

import datafusion
from datafusion import lit, functions as F
ctx = datafusion.SessionContext()
df = ctx.from_pydict({"a": [1]})
df.select(F.substr(lit("test text"), lit(0), lit(3)).alias("subst")).show()

In 37.1.0 produces:

DataFrame()
+-------+
| subst |
+-------+
| te    |
+-------+

In 38.0.0 produces:

Traceback (most recent call last):
  File "/Users/tsaucer/src/personal/datafusion-python/examples/test_substr.py", line 22, in <module>
    df.select(F.substr(lit("test text"), lit(0), lit(3)).alias("subst")).show()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: substr() takes 2 positional arguments but 3 were given

Expected behavior

No change in performance, but it is probably better to have users update to calling substring. Currently this function is not exposed, but it is included as part of PR https://github.com/apache/datafusion-python/pull/711

Additional context

This two line commit will resolve the issue:

https://github.com/apache/datafusion-python/pull/711/commits/259f79cc3619047bd4150ca26cd30be1feabace1