edanalytics / earthmover

CLI tool for transforming collections of tabular source data into a variety of text-based data formats via YAML configuration and Jinja templates.
Apache License 2.0
19 stars 2 forks source link

Better wildcard select on more column Operations #116

Open jayckaiser opened 1 month ago

jayckaiser commented 1 month ago

PR110 added a select-all wildcard in ModifyColumnsOperation. By using "*" as the column name, you can apply the same column transformation to every column in the dataframe.

There are two additions I'd like to make to this feature:

  1. Allow asterisks to be used as a wildcard capture for partial column name matching, not just all columns. This match could be done using the built-in fnmatch library. For example (thanks Tom):

    # given a $sources.X with columns `a_1`, `a_2`, `b_1`, `b_2`, `a_1_b_2`, `b_1_a_2`
    transformations:
    my_test:
    source: $sources.X
    operations:
    - operation: modify_columns
      columns:
        a_*: "{%raw%}{{value|trim}}{%endraw%}" # modifies `a_1`, `a_2`, `a_1_b_2`
        *_1: "{%raw%}{{value|trim}}{%endraw%}" # modifies `a_1`, `b_1`
        *_1_*: "{%raw%}{{value|trim}}{%endraw%}" # modifies `a_1_b_2`, `b_1_a_2`
        *: "{%raw%}{{value|trim}}{%endraw%}" # modifies `a_1`, `a_2`, `b_1`, `b_2`, `a_1_b_2`, `b_1_a_2` (all columns)
  2. Extend wildcard matching to more column operations:

    • ModifyColumnsOperation
    • DropColumnsOperation
    • KeepColumnsOperation
    • MapValuesOperation
    • DateFormatOperation
    • SnakeCaseColumnsOperation