Nike-Inc / koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
https://engineering.nike.com/koheesio/
Apache License 2.0
602 stars 29 forks source link

[BUG] Incorrect default for ColumnConfig. run_for_all_data_type and ColumnConfig. limit_data_type #85

Open mikita-sakalouski opened 3 weeks ago

mikita-sakalouski commented 3 weeks ago

Describe the bug

Currently we are using the following default values for ColumnConfig. run_for_all_data_type and ColumnConfig. limit_data_type :

run_for_all_data_type: Optional[List[SparkDatatype]] = [None]
limit_data_type: Optional[List[SparkDatatype]] = [None]
data_type_strict_mode: bool = False

and we have checks for validating that run_for_all_data_type exists:

if columns[0] == "*" and not run_for_all_data_type:
            raise ValueError("Cannot use '*' as a column name when no run_for_all_data_type is set")

but [None] is always exists, as it is list with 1 element which is equal None.

Steps to Reproduce

  1. Initiate the ColumnConfig class
  2. Call run_for_all_data_type attribute
  3. Check for existing of attribute

Expected behavior

Default value should be changed to None and code should be fixed to provide correct behavior.