GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
405 stars 119 forks source link

`generate-table-partitions` fails for Spanner with Unknown dialect spanner #1059

Closed sundar-mudupalli-work closed 11 months ago

sundar-mudupalli-work commented 11 months ago

Hi

A typical generate-table-partitions command fails because we are not able to convert the IBIS where clause (i.e. an IBIS table expression) into SQL text. This works for 8 other databases BigQuery, Hive, MySQL, Oracle, Postgres, Snowflake, SQL Server and Teradata (we have test cases for all those databases).

The command is

data-validation generate-table-partitions --secret-manager-type gcp --secret-manager-project-id span-cloud-ck-testing-external --source-conn my_mysql -tc my_spanner --tables-list person.person1=test-db.person1 --hash 'first_name2' --primary-keys ID -cdir gs://dvt-ck-cloud-run-testing/mudupalli/MySp8GPart16 -pn 16

Replacing my_spanner (which points to a Spanner database) with my_mysql - i.e. validate against same database succeeds.

The specific error is below:

  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/data_validation/partition_builder.py", line 73, in partition_configs
    partition_filters = self._get_partition_key_filters()
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/data_validation/partition_builder.py", line 229, in _get_partition_key_filters
    self._extract_where(target_table.filter(filter_target_clause))
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/data_validation/partition_builder.py", line 85, in _extract_where
    return re.sub(r"\s\s+", " ", ibis.to_sql(x).sql.split("WHERE")[1]).replace(
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/ibis/expr/sql.py", line 381, in to_sql
    (pretty,) = sg.transpile(sql, read=read, write=write, pretty=True)
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/sqlglot/__init__.py", line 187, in transpile
    for expression in parse(sql, read, error_level=error_level)
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/sqlglot/__init__.py", line 74, in parse
    dialect = Dialect.get_or_raise(read)()
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/sqlglot/dialects/dialect.py", line 136, in get_or_raise
    raise ValueError(f"Unknown dialect '{dialect}'")
ValueError: Unknown dialect 'spanner'
sundar-mudupalli-work commented 11 months ago

Hi,

The problem occurs because the Spanner backend behaves differently from the BigQuery backend - perhaps the Spanner backend is not fully implemented. I can work around this issue - however it might be a very simple fix in the Spanner backend. Here is a Python interpreter session demonstrating the issue. BigQuery works fine with the to_sql function returning the sql string. When Spanner is presented with a table expression that execute() correctly, it still fails to render with the to_sql function.

(env) mudupalli@dvt-mudupalli-test001:~/gcsdir$ python3
Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ibis
>>> from data_validation.clients import get_data_client
>>> bq_c = get_data_client({"source_type": "BigQuery", "project_id": "span-cloud-ck-testing-external"})
>>> result_tbl = bq_c.table('span-cloud-ck-testing-external.pso_data_validator.results')
>>> ibis.to_sql(result_tbl)
SQLString(sql='SELECT\n  t0.*\nFROM `span-cloud-ck-testing-external.pso_data_validator.results` AS t0')
>>> span_c = get_data_client({"source_type": "Spanner", "instance_id": "df-test", "database_id": "test-db"})
>>> person_tbl = span_c.table('person1')
>>> person_100 = person_tbl.filter(person_tbl.ID == 100)
>>> person_100.execute()
                                         first_name1  ...   ID
0  dshlfiuadhsfkudashgfkuadshgkaudgshkuagykfuydag...  ...  100

[1 rows x 7 columns]
>>> ibis.to_sql(person_100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/ibis/expr/sql.py", line 381, in to_sql
    (pretty,) = sg.transpile(sql, read=read, write=write, pretty=True)
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/sqlglot/__init__.py", line 187, in transpile
    for expression in parse(sql, read, error_level=error_level)
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/sqlglot/__init__.py", line 74, in parse
    dialect = Dialect.get_or_raise(read)()
  File "/home/mudupalli/professional-services-data-validator/env/lib/python3.9/site-packages/sqlglot/dialects/dialect.py", line 136, in get_or_raise
    raise ValueError(f"Unknown dialect '{dialect}'")
ValueError: Unknown dialect 'spanner'

The issue seems to be that the dialect spanner is not registered with sqlglot. Is there a way to fix this in the backend?

Thanks.

Sundar Mudupalli

nehanene15 commented 11 months ago

The easiest way to do this is to pass in the BQ dialect for Spanner clients in the to_sql() call we make in partition_builder. BQ and Spanner have the same SQL syntax so this should be interchangeable.

This is a separate issue, but Spanner doesn't support ROW_NUMBER() as shown in this error when Spanner is the source system: Command: data-validation -v generate-table-partitions -sc spanner -tc spanner -tbls pso_data_validator.dvt_core_types -pk id -hash '*' -pn 2 -cdir /home/user/professional-services-data-validator/cdir

Error: google.api_core.exceptions.MethodNotImplemented: 501 Unsupported built-in function: row_number.