GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
385 stars 108 forks source link

custom-query validations broken on Hive backend #1162

Closed nehanene15 closed 1 month ago

nehanene15 commented 1 month ago

When running custom-query column or row validations, the query generated is invalid. It appends a "t0" prefix to the column names which creates an invalid query.

Command to reproduce:

data-validation -v validate custom-query column -sc hive -tc hive -sq "select id, name as count_name from default.mascot" -tq "select id, name as count_name from default.mascot"

Query generated:

SELECT count(1) AS `count`
FROM (
  SELECT t1.`t0.id`, t1.`t0.name`
  FROM (
    select id, name from default.mascot
  ) t1
) t0

Error: impala.error.HiveServer2Error: Error while compiling statement: FAILED: SemanticException [Error 10002]: Line 3:12 Invalid column reference 't0.id'

We should add a custom query integration test for Hive as part of this PR.

nehanene15 commented 1 month ago

Issue stems here: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/clients.py#L146

If I print (iq.columns), it returns ['t0.id', 't0.name'] when it should be ['id', 'name']

Raniksingh commented 1 month ago

Possible fix is to remove t0 by adding below line. prefix which is getting generate at this function

cur.description = [(x[0].replace('t0.', '', 1), *x[1:]) for x in cur.description]

Initial test worked fine. But need thorough testing.
cc - @piyushsarraf