apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 64 forks source link

tsaucer/run TPC-H examples in CI #711

Closed timsaucer closed 1 month ago

timsaucer commented 1 month ago

Which issue does this PR close?

Closes #696 Closes #712

Rationale for this change

This PR sets up a work flow to generate TPH-C 1Gb data set in CI, runs the 22 examples, and compares their results to the known answer file. By adding this PR we improve the robustness of our test suite.

What changes are included in this PR?

This PR adds the following changes:

Are there any user-facing changes?

substring function exposed in python.

Additional context

This PR replaces https://github.com/apache/datafusion-python/pull/710 which contains a lot of intermediate testing steps. This MR should be cleaner to review.

timsaucer commented 1 month ago

@Michael-J-Ward It looks like we have a potential regression between 37.1.0 and 38.0.0. Namely substr on 37.1.0 would accept a start and length, the parameters that should apply to substring. I was using substr incorrectly but it worked by accident on 37.1.0 because substr and substring were calling the same underlying function. In 38.0.0 this is updated and substr now fails if you pass it two parameters.

This PR includes exposing substring which was missing on the python side.