apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.67k stars 1.06k forks source link

suggest add synax `select from generate_series()` #10069

Open l1t1 opened 3 months ago

l1t1 commented 3 months ago

Is your feature request related to a problem or challenge?

it is complex and slowly to use unnest().

Describe the solution you'd like

it can run select from generate_series() as duckdb does.

Describe alternatives you've considered

No response

Additional context

D:\duckdb>datafusion-cli
DataFusion CLI v37.0.0

> select sum(a) from unnest(generate_series(1,1000000)) as t(a);
+--------------+
| SUM(t.a)     |
+--------------+
| 500000500000 |
+--------------+
1 row(s) fetched.
Elapsed 3.648 seconds.
> select sum(a) from generate_series(1,1000000) as t(a);
Error during planning: table function 'generate_series' not found
>
D:\duckdb>duckdb011
v0.10.1 4a89d97db8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D .timer on
D select sum(a) from unnest(generate_series(1,1000000)) as t(a);
┌──────────────┐
│    sum(a)    │
│    int128    │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.854 user 0.859375 sys 0.078125
D select sum(a) from generate_series(1,1000000) as t(a);
┌──────────────┐
│    sum(a)    │
│    int128    │
├──────────────┤
│ 500000500000 │
└──────────────┘
Run Time (s): real 0.033 user 0.015625 sys 0.000000
alamb commented 3 months ago

I think we could add this as a user defined table function in datafusion-cli quite easily

For example we could follow the model of parquet_metadata (which also follows duckdb):

SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.

It is documented here https://arrow.apache.org/datafusion/user-guide/cli.html

The code for it is here; https://github.com/apache/arrow-datafusion/blob/637293580db0634a4efbd3f52e4700992ee3080d/datafusion-cli/src/functions.rs#L215-L442

Lordworms commented 3 months ago

related https://github.com/apache/arrow-datafusion/issues/9354