GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
405 stars 117 forks source link

Group By validation for Hive uses unsupported syntax #319

Closed nehanene15 closed 2 years ago

nehanene15 commented 3 years ago

Ibis generates a group_by query like so:

SELECT `bikeid`, count(*) AS `count`
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
GROUP BY 1

which is unsupported in Hive. Group_By requires the column name instead of the integer reference.

This will require overwritting the fomrat_group_by() function in third_party/ibis/ibis_impala/compiler.py so that we can reference the column name.

ghost commented 2 years ago

I was able to solve this issue by setting below property in Hive hive.groupby.orderby.position.alias=true

There are two approaches to set this property:

  1. Pass this property as part of cluster creation
  2. Run this command when you create hive connection in the source code.

I followed approach 1 and it fixed the issue.