eakmanrq / sqlframe

Turning PySpark Into a Universal DataFrame API
https://sqlframe.readthedocs.io/en/stable/
MIT License
174 stars 3 forks source link

Cannot execute the generated BQ SQL from the example #117

Closed tvalentyn closed 5 days ago

tvalentyn commented 1 week ago

I tried out the example on: https://github.com/eakmanrq/sqlframe using the latest installation of sqlframe[bigquery], generated a SQL statement with print(df.sql()), and then tried to feed the output to bq cli client.

bq query --use_legacy_sql=False "$(cat sqlframe_output.sql)"

I am getting this error:

Error in query string: Error processing job
'<myproject>:bqjob_r3a33e90d7c450233_000001907092c8d6_1': Unrecognized name:
percent_change; Did you mean percent change? at [35:7]

If I tweak the dataframe code and change the alias to "percent_change" (via .alias("percent_change")), query is updated and I get this error instead:

- query: No matching signature for function ABS for argument types:
STRING. Supported signatures: ABS(INT64); ABS(FLOAT64);
ABS(NUMERIC); ABS(BIGNUMERIC) at [35:3]

I wonder if I am missing something obvious. Is the output of df.sql() the same query that is executed by the bigquery client during df.show()? Thanks!

eakmanrq commented 1 week ago

Well that is interesting. So to get the SQL that SQLFrame actually executes you will want to set optimize=False. By default when printing SQL it prints the SQLGlot optimized version which is much more readable but can produce SQL that is incorrect. Can you try optimize=False and let me know if that works for you?

tvalentyn commented 6 days ago

Setting optimize=False produces a query that executes without issues. Is it a bug in the optimizer then?

eakmanrq commented 6 days ago

Most likely but I would need to check the details to make sure.

From SQLFrame's perspective it's contract is to produce accurate non-optimized SQL. As you can probably see though that SQL isn't very readable so therefore if you are sharing the SQL with others to help them understand what the pipeline does then the optimized SQL is better. Just need to keep in mind that it is possible that SQL is not completely accurate but if the goal is to share a common understanding then it should be sufficient for that (and likely just requires some minor fixes to work if it doesn't already).

If your goal is to execute the SQL then you can let SQLFrame do that for you or do what you are doing and output non-optimized SQL to a file and execute that using whatever tool you want.