GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
399 stars 114 forks source link

generate-table-partitions between Teradata and BQ does not work with uppercase primary key names #1136

Closed sundar-mudupalli-work closed 4 months ago

sundar-mudupalli-work commented 4 months ago

Hi,

It appears that generate-table-partitions does not work with primary key names in Teradata that are in upper case. Teradata's column names are not case sensitive.

The DDL that was used was

CREATE SET TABLE udfs.issue1136, FALLBACK (
    N_NATIONKEY INTEGER NOT NULL,
    N_NAME CHAR(25) CHARACTER SET LATIN CASESPECIFIC NOT NULL,
    N_REGIONKEY INTEGER NOT NULL,
    N_COMMENT CHAR(152) CHARACTER SET LATIN CASESPECIFIC NOT NULL )
UNIQUE PRIMARY INDEX ( N_NATIONKEY);

When we try the following command, it works fine data-validation generate-table-partitions -sc teradata -tc teradata -tbls=udfs.issue1136 -pk=N_NATIONKEY -hash '*' -pn 3 -cdir .

The real purpose is to compare two tables from different databases - so we try to compare teradata with BigQuery table defined as follows:

CREATE TABLE pso_data_validator.issue1136 (
  n_nationkey INT64,
  n_name STRING,
  n_regionkey INT64,
  n_comment STRING
);

If we use the command data-validation generate-table-partitions -sc teradata -tc bq -tbls=udfs.issue1136=pso_data_validator.issue1136 -pk=N_NATIONKEY -hash '*' -pn 3 -cdir . you get the error:

    target_count = target_partition_row_builder.get_count()
  File "/home/user/professional-services-data-validator/env/lib/python3.10/site-packages/data_validation/query_builder/partition_row_builder.py", line 65, in get_count
    return self.query[self.primary_keys].count().execute()
...
ibis.common.exceptions.IbisTypeError: Column 'N_NATIONKEY' is not found in table. Existing columns: 'n_nationkey', 'n_name', 'n_regionkey', 'n_comment'.

However if you invert the case and try data-validation generate-table-partitions -sc teradata -tc bq -tbls=udfs.issue1136=pso_data_validator.issue1136 -pk=n_nationkey -hash '*' -pn 3 -cdir . you get the error:

    source_count = source_partition_row_builder.get_count()
  File "/home/user/professional-services-data-validator/env/lib/python3.10/site-packages/data_validation/query_builder/partition_row_builder.py", line 65, in get_count
    return self.query[self.primary_keys].count().execute()
....
ibis.common.exceptions.IbisTypeError: Column 'n_nationkey' is not found in table. Existing columns: 'N_NATIONKEY', 'N_NAME', 'N_REGIONKEY', 'N_COMMENT'.

In the first case, ibis is not able to execute the count operation since the BQ column names are all in lower case. In the second case, ibis is not able to execute the count operation since the Teradata column names are all in upper case. So either case you use - it does not work. We try to get the count to make sure the table sizes are within 10% of each other - if not the partition split is not going to be very even.

I checked with a table defined with lower case column names in Oracle and it return ibis column names in lower case. A bit more investigation is needed.

Suggestions on how to fix ?