apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.56k stars 197 forks source link

PyBallista - Python SQL client for Ballista #970

Closed andygrove closed 10 months ago

andygrove commented 10 months ago

Which issue does this PR close?

N/A

Rationale for this change

The Python bindings in https://github.com/apache/arrow-ballista-python were created by cloning the DataFusion Python bindings and then making some modifications. This project has been unmaintained for around one year now.

This PR adds new Python bindings which depend on the datafusion-python project rather than copying all of the code. I propose that we archive the old repo.

This project will be versioned and released independently from the main project and is not part of the default Cargo workspace, so will not get in the way of Rust development work.

>>> from pyballista import SessionContext
>>> ctx = SessionContext("localhost", 50050)
>>> ctx.sql("create external table t stored as parquet location '/mnt/bigdata/tpch/sf10-parquet/lineitem.parquet'")
>>> df = ctx.sql("select * from t limit 5")
>>> df.collect()

Output:

[pyarrow.RecordBatch
l_orderkey: int64 not null
l_partkey: int64 not null
l_suppkey: int64 not null
l_linenumber: int32 not null
l_quantity: decimal128(11, 2) not null
l_extendedprice: decimal128(11, 2) not null
l_discount: decimal128(11, 2) not null
l_tax: decimal128(11, 2) not null
l_returnflag: string not null
l_linestatus: string not null
l_shipdate: date32[day] not null
l_commitdate: date32[day] not null
l_receiptdate: date32[day] not null
l_shipinstruct: string not null
l_shipmode: string not null
l_comment: string not null
_ignore: string not null
----
l_orderkey: [17500001,17500001,17500001,17500001,17500001]
l_partkey: [1661786,1635206,907114,1849041,820472]
l_suppkey: [36835,60223,32124,24096,20473]
l_linenumber: [2,3,4,5,6]
l_quantity: [50.00,26.00,25.00,14.00,33.00]
l_extendedprice: [87385.00,29669.12,28026.75,13859.30,45950.19]
l_discount: [0.09,0.09,0.08,0.08,0.06]
l_tax: [0.00,0.04,0.04,0.04,0.07]
l_returnflag: ["N","N","N","N","N"]
l_linestatus: ["O","O","O","O","O"]
...]

What changes are included in this PR?

New pyballista folder containing the new Python bindings.

Are there any user-facing changes?

No

honeyAndSw commented 6 months ago

Hi @andygrove, after this change, the datafusion functions submodule seems not included in PyBallista.

If I try to find functions within PyBallista

from pyballista import SessionContext, functions as f

filename = "sample.parquet"
ctx = SessionContext("localhost", 50050)

df = ctx.read_parquet(path=filename)
uri = df.select(f.col("stream_id"))
df.limit(10).show()

I'm getting the following error:

ImportError: cannot import name 'functions' from 'pyballista' (/Users/.../datafusion-ballista/python/pyballista/__init__.py)

Or if I change my imports to:

from pyballista import SessionContext
from datafusion import functions as f

Then it says:

  File "examples/test.py", line 11, in <module>
    uri = df.select(f.col("stream_id"))
TypeError: argument 'args': 'Expr' object cannot be converted to 'Expr'