datacontract / datacontract-cli

CLI to manage your datacontract.yaml files
https://cli.datacontract.com
Other
468 stars 88 forks source link

Idea: use sqlglot to support different sql dialects #160

Open jverhoeks opened 6 months ago

jverhoeks commented 6 months ago

SQLglot is a great python library to read, parse , translate and write 21 different sql dialects.

https://github.com/tobymao/sqlglot

It supports 21 dialects. This make it able to create the schema in 1 sql language and then translate to the others. Where sqlglot takes care of the syntax. It can also read any create table sql statement and convert into a standard AST for import parsing.

from sqlglot import transpile

ducksql = """CREATE TABLE t1 (
    i INTEGER NOT NULL,
    decimalnr DOUBLE,
    date DATE UNIQUE,
    time TIMESTAMP
);
"""

print("orig duckdb")
print(ducksql)

for dialect in ["bigquery", "spark", "athena", "snowflake"]:
    print(dialect)
    print(transpile(ducksql, read="duckdb", write=dialect, pretty=True)[0])

Output

orig duckdb
CREATE TABLE t1 (
    i INTEGER NOT NULL,
    decimalnr DOUBLE,
    date DATE UNIQUE,
    time TIMESTAMP
);

bigquery
CREATE TABLE t1 (
  i INT64 NOT NULL,
  decimalnr FLOAT64,
  date DATE UNIQUE,
  time DATETIME
)
spark
CREATE TABLE t1 (
  i INT NOT NULL,
  decimalnr DOUBLE,
  date DATE,
  time TIMESTAMP
)
athena
CREATE TABLE t1 (
  i INTEGER NOT NULL,
  decimalnr DOUBLE,
  date DATE UNIQUE,
  time TIMESTAMP
)
snowflake
CREATE TABLE t1 (
  i INT NOT NULL,
  decimalnr DOUBLE,
  date DATE UNIQUE,
  time TIMESTAMPNTZ
)
jochenchrist commented 6 months ago

Awesome! Would greatly fit into https://github.com/datacontract/datacontract-cli/blob/main/datacontract/export/sql_converter.py for different (all supported server types)

Do you have some time to prepare a PR?

jverhoeks commented 6 months ago

What kind of SQL data types you use in the datacontract format? Is that duckdb based?

jochenchrist commented 6 months ago

The Data Contract Specification defines these data types: https://datacontract.com/#data-types

jverhoeks commented 6 months ago

I have finished my other PR https://github.com/datacontract/datacontract-cli/pull/166 I'll work on this now for both import and export

simonharrer commented 6 months ago

Can you elaborate how you would implement this here? We do have a lot of needs for the type mappings for various exports, and sometimes we even export sql. I am not totally convinced how this helps here to be honest.

As most tapes are very similar to snowflake types, one could use those and convert to others. But there are exceptions where we have special type mappings for other technologies. If we map to one sql dialect and then from there, we no longer have access to the original information.

jverhoeks commented 6 months ago

I'm working on a small demo where i use the AST from sqlglot to build a generic import & export (https://github.com/tobymao/sqlglot?tab=readme-ov-file#ast-introspection)

sqlfile + dialect -> AST -> datacontract datacontract -> AST -> sqlfile+dialect

jverhoeks commented 5 months ago

Starting to work on some test code: https://github.com/jverhoeks/test_sqlparse

Datatype mapping is in progress

jochenchrist commented 3 months ago

The challenge with sqlglot is to overwrite the type (or other properties) from the config object https://datacontract.com/#config-object for the target sql dialect.

@jverhoeks's approach sounds interesting, if we find a way to specify the type for the generation process.