Closed tvalentyn closed 5 days ago
Well that is interesting. So to get the SQL that SQLFrame actually executes you will want to set optimize=False
. By default when printing SQL it prints the SQLGlot optimized version which is much more readable but can produce SQL that is incorrect. Can you try optimize=False
and let me know if that works for you?
Setting optimize=False
produces a query that executes without issues. Is it a bug in the optimizer then?
Most likely but I would need to check the details to make sure.
From SQLFrame's perspective it's contract is to produce accurate non-optimized SQL. As you can probably see though that SQL isn't very readable so therefore if you are sharing the SQL with others to help them understand what the pipeline does then the optimized SQL is better. Just need to keep in mind that it is possible that SQL is not completely accurate but if the goal is to share a common understanding then it should be sufficient for that (and likely just requires some minor fixes to work if it doesn't already).
If your goal is to execute the SQL then you can let SQLFrame do that for you or do what you are doing and output non-optimized SQL to a file and execute that using whatever tool you want.
I tried out the example on: https://github.com/eakmanrq/sqlframe using the latest installation of sqlframe[bigquery], generated a SQL statement with
print(df.sql())
, and then tried to feed the output to bq cli client.bq query --use_legacy_sql=False "$(cat sqlframe_output.sql)"
I am getting this error:
If I tweak the dataframe code and change the alias to "percent_change" (via
.alias("percent_change")
), query is updated and I get this error instead:I wonder if I am missing something obvious. Is the output of df.sql() the same query that is executed by the bigquery client during
df.show()
? Thanks!