GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
394 stars 110 forks source link

custom query row validation has inconsistent column case to table validation #992

Closed nj1973 closed 11 months ago

nj1973 commented 11 months ago

This example came up on an Oracle to PostgreSQL validation. Default identifier case in Oracle is uppercase. Default identifier case in PostgreSQL is lowercase.

Oracle test table:

create table dvt_test.tab_case (id number(6), col_data varchar2(20));
insert into dvt_test.tab_case values (1,'Hello');
commit;

PostgreSQL test table:

create table dvt_test.tab_case (id numeric(6), col_data varchar(20));
insert into dvt_test.tab_case values (1,'Hello');

Table level row validation

data-validation validate row \
 -sc ora_local -tc pg_local \
 -tbls=dvt_test.tab_case \
 --primary-keys=id \
 --comparison-fields=col_data

This works fine and I can see from the debugger that the table objects in query_builder.compile() are:

AlchemyTable: tab_case
  id       decimal(6, 0)
  col_data string

AlchemyTable: tab_case
  id       decimal(6, 0)
  col_data string

Notice the upper case Oracle columns have been converted to lower case.

Custom query row validation

data-validation validate custom-query row \
 -sc ora_local -tc pg_local \
 --source-query='select * from dvt_test.tab_case' \
 --target-query="select * from dvt_test.tab_case" \
 --primary-keys=id \
 --comparison-fields=col_data
...
    raise com.IbisTypeError(
ibis.common.exceptions.IbisTypeError: Column 'col_data' is not found in table. Existing columns: 'ID', 'COL_DATA'.

I can see from the debugger that the table object in query_builder.pt.compile() is:

SQLQueryResult
  query: 'select * from dvt_test.tab_case'
  schema:
    ID       decimal(6, 0)
    COL_DATA string

The columns remain upper case.

If I change --comparison-fields to upper case we then fail on the PostgreSQL side.

data-validation validate custom-query row \
 -sc ora_local -tc pg_local \
 --source-query='select * from dvt_test.tab_case' \
 --target-query="select * from dvt_test.tab_case" \
 --primary-keys=id \
 --comparison-fields=COL_DATA
...
    raise com.IbisTypeError(
ibis.common.exceptions.IbisTypeError: Column 'COL_DATA' is not found in table. Existing columns: 'id', 'col_data'.

The workaround is to force the Oracle query to result in a lower case column.

I think we are inconsistent somewhere in the code. For table validations column names are lower cased, for custom query validations they are not.

nj1973 commented 11 months ago

I found that column names in an Ibis table object are lower cased, in sqla_table in the line below from ibis/backends/base/sql/alchemy/__init__.py

519             sqla_table = self._get_sqla_table(name, database=database, schema=schema)

I could not get to where exactly they are being lower cased.

I've made custom query column names also lower case via a change in DVT's _metadata method override of the Oracle client. I'm not particularly happy with it (feels grubby) but I'm not sure where to look for a better solution. Any thoughts @nehanene15 ?

nehanene15 commented 11 months ago

I wonder if this has to do with us building configs? For column/row validations we use casefold() to lowercase column names so the user can provide case insensitive names. Here is an example for column aggs/PKs.

I wonder if the config generated for regular validation and custom query is different since custom query uses the SQLQueryResult instead of the SQLAlchemyTable.

I'm also curious if other DB clients have this issue with custom query?

nj1973 commented 11 months ago

I can get the same exception if I pass upper case column names to SQL Server:

data-validation -v validate custom-query row -sc sqlserver -tc bq \
 --source-query="select ID,COL_INT8,COL_INT16,COL_INT32 from pso_data_validator.dvt_core_types" \
 --target-query="select id,col_int8,col_int16,col_int32 from pso_data_validator.dvt_core_types" \
 --primary-keys=id --comparison-fields="col_int16,col_int32"
...
  File "/home/user/professional-services-data-validator/env/lib/python3.10/site-packages/ibis/expr/operations/generic.py", line 39, in __init__
    raise com.IbisTypeError(
ibis.common.exceptions.IbisTypeError: Column 'col_int16' is not found in table. Existing columns: 'ID', 'COL_INT8', 'COL_INT16', 'COL_INT32'.

I guess the difference is that Oracle upper cases columns by default and most other systems lower case them. I didn't expect SQL Server to honour the case I used in the source query.

It seems this is a wider problem than just Oracle so, for now, I'll back out my previous change.

nj1973 commented 11 months ago

Note to self.

When validating an Oracle table, not custom query, we pass through

professional-services-data-validator/env/lib/python3.10/site-packages/sqlalchemy/dialects/oracle/base.py

    @reflection.cache
    def get_columns(self, connection, table_name, schema=None, **kw):
...
        c = connection.execute(sql.text(text), params)

        for row in c:
            colname = self.normalize_name(row[0])
            orig_colname = row[0]

Notice the use of self.normalize_name(row[0]). row[0] matches the data dictionary and is upper case by default in Oracle. self.normalize_name(row[0]) converts it to lower case.

The code to normalize_name() is here: sqlalchemy/engine/default.py(769)normalize_name()

I think we need to build normalize_name() into the flow for custom-query somewhere.