Open austin362667 opened 1 week ago
hi @austin362667. This is something that we have been actively prioritizing. Here's an up-to-date table demonstrating missing functionality by query
shape: (10, 3)
┌─────────────────────────────────────────────────────────────────────────────┬─────┬──────────────┐
│ message ┆ len ┆ q │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ list[i64] │
╞═════════════════════════════════════════════════════════════════════════════╪═════╪══════════════╡
│ Unsupported SQL: 'SUBQUERY' ┆ 3 ┆ [2, 15, 17] │
│ Unsupported SQL: 'IN subquery' ┆ 3 ┆ [16, 18, 20] │
│ Unsupported SQL: 'EXISTS' ┆ 2 ┆ [4, 21] │
│ Daft error: DaftError::TypeError Expected if_true and if_false arguments ┆ 1 ┆ [14] │
│ for if_else to be castable to the same supertype, but received ┆ ┆ │
│ l_extendedprice#Decimal(precision=38, scale=4) and literal#Int64 ┆ ┆ │
│ Daft error: DaftError::FieldNotFound Column "c_count" not found in schema: ┆ 1 ┆ [13] │
│ ["c_custkey", "o_orderkey"] ┆ ┆ │
│ Daft error: DaftError::TypeError Cannot infer supertypes for subtraction on ┆ 1 ┆ [9] │
│ types: Decimal(precision=38, scale=4), Decimal(precision=30, scale=4) ┆ ┆ │
│ result precision: 39 exceed bounds of [1, 38] ┆ ┆ │
│ Daft error: DaftError::TypeError Expected if_true and if_false arguments ┆ 1 ┆ [8] │
│ for if_else to be castable to the same supertype, but received ┆ ┆ │
│ volume#Decimal(precision=38, scale=4) and literal#Int64 ┆ ┆ │
│ Daft error: DaftError::TypeError Cannot infer supertypes for multiply on ┆ 1 ┆ [1] │
│ types: Decimal(precision=38, scale=4), Decimal(precision=23, scale=2) ┆ ┆ │
│ result precision: 61 exceed bounds of [1, 38] ┆ ┆ │
│ Unsupported SQL: '`SUBSTRING(expr [FROM start] [FOR len])` syntax' ┆ 1 ┆ [22] │
│ Unsupported SQL: 'HAVING' ┆ 1 ┆ [11] │
└─────────────────────────────────────────────────────────────────────────────┴─────┴──────────────┘
oh nice! Thanks for providing this table
Is your feature request related to a problem?
The repository at https://github.com/Eventual-Inc/distributed-query-benchmarking currently works only with the DataFrame API. For benchmarking TPC-H and similar workloads, we may want to enable direct SQL support, as the original queries are written in SQL. Since not all of the SQL capabilities are unlocked, such as selecting from more than one table[^1]. This epic issue is aimed for tracking the progress.
[^1]: Slack Discussion https://dist-data.slack.com/archives/C052CA6Q9N1/p1729692929866809?thread_ts=1729664580.862499&cid=C052CA6Q9N1
Describe the solution you'd like
To fully support SQL syntax in TPC-H,
fromstr
for Interval, might need refactor the shared errors: https://github.com/Eventual-Inc/Daft/pull/3146#discussion_r1821955549daft.exceptions.InvalidSQLException: Unsupported SQL: 'Only exactly one table is supported'
Describe alternatives you've considered
No response
Additional Context
No response
Would you like to implement a fix?
Yes