GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
405 stars 117 forks source link

Support Ignoring Case for DB columns supplied in grouped or aggregated config #91

Closed dhercher closed 4 years ago

dhercher commented 4 years ago

If a DB has columns in non-lowercase and they get converted to lowercase during ETL (which often happens), then the current CLI makes it diffuclt to match them.

The case below should work even if the source column is named StartTime for example.

python -m data_validation run -t GroupedColumn -sc my_bq_conn -tc my_bq_conn -tbls '[{"schema_name":"bigquery-public-data.new_york_citibike","table_name":"citibike_trips"}]' --grouped-columns '["starttime"]' --sum '["tripduration"]' --count '["tripduration"]'

renzokuken commented 4 years ago

Clarifying question: this is specifically for when there is a delta between the casing of source and target, not improving the usability of the CLI?

renzokuken commented 4 years ago

I've relaxed the case sensitivity on the config manager, however it seems that the internal ibis client also performs a case sensitive validation when the query object is compiled. Attempting to pass an insensitive column results in the following stack trace.

 File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/bin/data-validation", line 8, in <module>
    sys.exit(main())
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 227, in main
    run(args)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 206, in run
    run_validations(args, config_managers)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 182, in run_validations
    run_validation(config_manager, verbose=args.verbose)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 171, in run_validation
    validator.execute()
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/data_validation.py", line 84, in execute
    self.validation_builder.get_source_query()
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/validation_builder.py", line 173, in get_source_query
    query = self.source_builder.compile(**source_config)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 217, in compile
    groups = self.compile_group_fields(table)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 202, in compile_group_fields
    return [field.compile(table) for field in self.grouped_fields]
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 202, in <listcomp>
    return [field.compile(table) for field in self.grouped_fields]
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 145, in compile
    group_field = ibis_table[self.field_name]
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/types.py", line 409, in __getitem__
    return self.get_column(what)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/types.py", line 523, in get_column
    ref = ops.TableColumn(name, self)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/operations.py", line 224, in __init__
    super().__init__(name, table)
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/signature.py", line 183, in __init__
    self._validate()
  File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/operations.py", line 228, in _validate
    raise com.IbisTypeError(
ibis.common.exceptions.IbisTypeError: 'biRTh_yEAr' is not a field in ['tripduration', 'starttime', 'stoptime', 'start_station_id', 'start_station_name', 'start_station_latitude', 'start_station_longitude', 'end_station_id', 'end_station_name', 'end_station_latitude', 'end_station_longitude', 'bikeid', 'usertype', 'birth_year', 'gender', 'customer_plan']

If I'm understanding correctly, the Ibis repo will also need to be updated. @dhercher or @tswast can you please check if my assumption is correct?

renzokuken commented 4 years ago

talked it over with Dylan and found a workaround. PR submitted.