apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 63 forks source link

Weird behaviour with explain in SQL #589

Open Jefffrey opened 4 months ago

Jefffrey commented 4 months ago

Describe the bug

Original issue from datafusion repository: https://github.com/apache/arrow-datafusion/issues/9319

Error when doing explain directly in SQL.

To Reproduce

Using latest available version on Pypi (35.0.0)

>>> import datafusion
>>> ctx = datafusion.SessionContext()
>>> ctx.sql("explain select 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: Internal error: Unsupported logical plan: Explain must be root of the plan.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Expected behavior

The ctx.sql("explain select 1") should ideally succeed, as it works in regular DataFusion:

    let ctx = SessionContext::new();
    let df = ctx.sql("explain select 1").await?;
    df.show().await?;

Will output:

+---------------+--------------------------------------+
| plan_type     | plan                                 |
+---------------+--------------------------------------+
| logical_plan  | Projection: Int64(1)                 |
|               |   EmptyRelation                      |
| physical_plan | ProjectionExec: expr=[1 as Int64(1)] |
|               |   PlaceholderRowExec                 |
|               |                                      |
+---------------+--------------------------------------+

Additional context

Although the more correct way would be to do ctx.sql("select 1").explain(), ctx.sql("explain select 1") itself is technically still valid, so maybe something weird is going on here?

Michael-J-Ward commented 2 weeks ago

ctx.sql(..) returns a PyDataFrame, and then you are observing the python shell call PyDataFrame::__repr__ to print it out, which applies a LIMIT and collects the first 10 rows.

If you were to assign it and then collect, you'd see the proper result.

>>> df = ctx.sql("explain select 1")
>>> df.collect()
[pyarrow.RecordBatch
plan_type: string not null
plan: string not null
----
plan_type: ["logical_plan","physical_plan"]
plan: ["Projection: Int64(1)
  EmptyRelation","ProjectionExec: expr=[1 as Int64(1)]
  PlaceholderRowExec
"]]

And then here's the error reproduced with a little more detail.

>>> df = ctx.sql("explain select 1")
>>> # we can print the (unsupported) logical plan - notice `EXPLAIN` is not the top row of the plan
>>> df.limit(count=10, offset=0).explain()
DataFrame()
+--------------+--------------------------+
| plan_type    | plan                     |
+--------------+--------------------------+
| logical_plan | Limit: skip=0, fetch=10  |
|              |   Explain                |
|              |     Projection: Int64(1) |
|              |       EmptyRelation      |
+--------------+--------------------------+
>>> # but calling collect reproduces the error
>>> df.limit(count=10, offset=0).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: Internal error: Unsupported logical plan: Explain must be root of the plan.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker