Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

[EPIC] Fully Support SQL Syntax used in TPC-H #3254

Open austin362667 opened 1 week ago

austin362667 commented 1 week ago

Is your feature request related to a problem?

The repository at https://github.com/Eventual-Inc/distributed-query-benchmarking currently works only with the DataFrame API. For benchmarking TPC-H and similar workloads, we may want to enable direct SQL support, as the original queries are written in SQL. Since not all of the SQL capabilities are unlocked, such as selecting from more than one table[^1]. This epic issue is aimed for tracking the progress.

[^1]: Slack Discussion https://dist-data.slack.com/archives/C052CA6Q9N1/p1729692929866809?thread_ts=1729664580.862499&cid=C052CA6Q9N1

Describe the solution you'd like

To fully support SQL syntax in TPC-H,

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

Yes

universalmind303 commented 1 week ago

hi @austin362667. This is something that we have been actively prioritizing. Here's an up-to-date table demonstrating missing functionality by query

shape: (10, 3)
┌─────────────────────────────────────────────────────────────────────────────┬─────┬──────────────┐
│ message                                                                     ┆ len ┆ q            │
│ ---                                                                         ┆ --- ┆ ---          │
│ str                                                                         ┆ u32 ┆ list[i64]    │
╞═════════════════════════════════════════════════════════════════════════════╪═════╪══════════════╡
│ Unsupported SQL: 'SUBQUERY'                                                 ┆ 3   ┆ [2, 15, 17]  │
│ Unsupported SQL: 'IN subquery'                                              ┆ 3   ┆ [16, 18, 20] │
│ Unsupported SQL: 'EXISTS'                                                   ┆ 2   ┆ [4, 21]      │
│ Daft error: DaftError::TypeError Expected if_true and if_false arguments    ┆ 1   ┆ [14]         │
│ for if_else to be castable to the same supertype, but received              ┆     ┆              │
│ l_extendedprice#Decimal(precision=38, scale=4) and literal#Int64            ┆     ┆              │
│ Daft error: DaftError::FieldNotFound Column "c_count" not found in schema:  ┆ 1   ┆ [13]         │
│ ["c_custkey", "o_orderkey"]                                                 ┆     ┆              │
│ Daft error: DaftError::TypeError Cannot infer supertypes for subtraction on ┆ 1   ┆ [9]          │
│ types: Decimal(precision=38, scale=4), Decimal(precision=30, scale=4)       ┆     ┆              │
│ result precision: 39 exceed bounds of [1, 38]                               ┆     ┆              │
│ Daft error: DaftError::TypeError Expected if_true and if_false arguments    ┆ 1   ┆ [8]          │
│ for if_else to be castable to the same supertype, but received              ┆     ┆              │
│ volume#Decimal(precision=38, scale=4) and literal#Int64                     ┆     ┆              │
│ Daft error: DaftError::TypeError Cannot infer supertypes for multiply on    ┆ 1   ┆ [1]          │
│ types: Decimal(precision=38, scale=4), Decimal(precision=23, scale=2)       ┆     ┆              │
│ result precision: 61 exceed bounds of [1, 38]                               ┆     ┆              │
│ Unsupported SQL: '`SUBSTRING(expr [FROM start] [FOR len])` syntax'          ┆ 1   ┆ [22]         │
│ Unsupported SQL: 'HAVING'                                                   ┆ 1   ┆ [11]         │
└─────────────────────────────────────────────────────────────────────────────┴─────┴──────────────┘
austin362667 commented 1 week ago

oh nice! Thanks for providing this table