dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
397 stars 72 forks source link

[BUG] Case sensitive=True config doesn't work as intended #1092

Open ayushdg opened 1 year ago

ayushdg commented 1 year ago

What happened: When setting sql.identifier.case_sensitive=True dask-sql still ends up converting identifiers to lowercase during the planning stage.

What you expected to happen: Case sensitivity being honored when set to True which is the default. This needs https://github.com/apache/arrow-datafusion/issues/5626 to go in after which the planner can be instantiated with the proper enable_ident_normalization value. https://github.com/dask-contrib/dask-sql/blob/11fda749d4dc4c729350a1a1e87c0ea631f11bce/dask_planner/src/sql.rs#L548 Minimal Complete Verifiable Example:

df = pd.DataFrame({"id": [0, 1], "VAL": [1, 2]})
c.create_table("test", df)
c.sql("select ID from test") # works even though ID should be treated differently to `id`
c.sql("select val from test") # fails since the schema has `VAL` but the planner is looking for `val`.
ayushdg commented 1 year ago

Another instance of this reported by @goodwanghan in https://github.com/dask-contrib/dask-sql/issues/1108#issue-1653305345

import pandas as pd
import dask.dataframe as dd
from dask_sql import Context

df = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-01.parquet")

ctx = Context()
res = ctx.sql(
    "SELECT PULocationID, COUNT(*) AS ct FROM df GROUP BY PULocationID ORDER BY ct DESC LIMIT 5",
    dataframes={"df":dd.from_pandas(df,npartitions=2)},
    config_options={"sql.identifier.case_sensitive":True}
)