Dialect translation of eval files

wendy-aw commented 2 months ago

Added translate_sql_dialect.py that takes in a csv file from the data/ folder and translates it into BigQuery, MySQL, T-SQL or SQLite. This will add one .sql file per data file per dialect into the same folder.
- The script uses sqlglot for initial translation but corrects the translation with a different LLM if found to be invalid on the database.
- It also checks for its validity against the relevant database system. You must have first set up the relevant eval databases in the defog-data repo.
As the questions_gen_postgres file has multiple correct SQL options in the query column, the script accommodates this by translating multiple SQLs per row. However, the translated query is not guaranteed to have the same dataframe result as in the original dialect.
The actual .sql files for these different dialects will come in the next PR after manual verification. Once these are in, evals can be performed in these various dialects.
Minor change to original questions_gen_postgres.csv: removed schema prefix
Minor change in eval/eval.py: fill NA vals in the df with -99999 to allow for comparison. Previously it would error out.

rishsriv commented 2 months ago

Woot! Thank you! Generally looks good – but just one point.

Wouldn't it be better for us to upload already translated question files for different dialects? That way, users will face much less complexity when using this for non postgres/redshift/snowflake dialects!

So essentially, maybe we can use the script and upload the files once (and save everyone else compute and LLM tokens :D) – while leaving these scripts as they are so users that want to modify the queries or add their own can use them?

wendy-aw commented 2 months ago

Thanks for the extra additions!

defog-ai / sql-eval

Dialect translation of eval files #148