Closed dhercher closed 4 years ago
Clarifying question: this is specifically for when there is a delta between the casing of source and target, not improving the usability of the CLI?
I've relaxed the case sensitivity on the config manager, however it seems that the internal ibis client also performs a case sensitive validation when the query object is compiled. Attempting to pass an insensitive column results in the following stack trace.
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/bin/data-validation", line 8, in <module>
sys.exit(main())
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 227, in main
run(args)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 206, in run
run_validations(args, config_managers)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 182, in run_validations
run_validation(config_manager, verbose=args.verbose)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/__main__.py", line 171, in run_validation
validator.execute()
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/data_validation.py", line 84, in execute
self.validation_builder.get_source_query()
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/validation_builder.py", line 173, in get_source_query
query = self.source_builder.compile(**source_config)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 217, in compile
groups = self.compile_group_fields(table)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 202, in compile_group_fields
return [field.compile(table) for field in self.grouped_fields]
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 202, in <listcomp>
return [field.compile(table) for field in self.grouped_fields]
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/data_validation/query_builder/query_builder.py", line 145, in compile
group_field = ibis_table[self.field_name]
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/types.py", line 409, in __getitem__
return self.get_column(what)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/types.py", line 523, in get_column
ref = ops.TableColumn(name, self)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/operations.py", line 224, in __init__
super().__init__(name, table)
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/signature.py", line 183, in __init__
self._validate()
File "/usr/local/google/home/emceehilton/professional-services-data-validator/venv/lib/python3.8/site-packages/ibis/expr/operations.py", line 228, in _validate
raise com.IbisTypeError(
ibis.common.exceptions.IbisTypeError: 'biRTh_yEAr' is not a field in ['tripduration', 'starttime', 'stoptime', 'start_station_id', 'start_station_name', 'start_station_latitude', 'start_station_longitude', 'end_station_id', 'end_station_name', 'end_station_latitude', 'end_station_longitude', 'bikeid', 'usertype', 'birth_year', 'gender', 'customer_plan']
If I'm understanding correctly, the Ibis repo will also need to be updated. @dhercher or @tswast can you please check if my assumption is correct?
talked it over with Dylan and found a workaround. PR submitted.
If a DB has columns in non-lowercase and they get converted to lowercase during ETL (which often happens), then the current CLI makes it diffuclt to match them.
The case below should work even if the source column is named
StartTime
for example.python -m data_validation run -t GroupedColumn -sc my_bq_conn -tc my_bq_conn -tbls '[{"schema_name":"bigquery-public-data.new_york_citibike","table_name":"citibike_trips"}]' --grouped-columns '["starttime"]' --sum '["tripduration"]' --count '["tripduration"]'