Sanitize columns - Githubissues

radbrt commented 2 years ago

Using the tap-sftp in combination with a loader such at target-snowflake can trigger errors due to invalid column names. The header in the CSV files can contain characters that are not valid as a snowflake column name.

I personally think this should be handled in the loader, because what constitutes a valid column name varies from target to target.

@pnadolny13 suggested looking at mappings to solve this, and a regex-based schema mapper looks like a good generic solution. Regardless of what we do with this tap, I think such a mapper would be a useful generic solution when there are issues between taps and targets.

It seems to me that the argument for creating this sanitizer in the tap is mostly pragmatic convenience. Special characters in headers are common in CSV files. Adding an optional keyword in the table config that could solve most issues that arise from column names, would mean a lot of users won't have to add a mapper.

In order to solve our immediate problem, I already created a basic sanitizing option. The main part: https://github.com/radbrt/tap-sftp/blob/a8ce9cc8e9598a64266024d456b30aa6d1b85bb7/tap_sftp/singer_encodings/csv_handler.py#L42

An example config:

      tables:
      - table_name: inc_data
        search_prefix: ''
        search_pattern: inc_.*
        key_properties: [key1]
        sanitize_headers: true
        delimiter: ','
        encoding: utf-8

pnadolny13 commented 2 years ago

@radbrt yeah I think it should be on the target to sanitize the column names but having an optional setting in the tap isn't a problem. I like the idea of having the default behavior stay as-is sending the unaltered column names but if a user has a need for it they can use the sanitize_headers config option!

@edgarrmondragon @aaronsteers in case you have thoughts too?

Also as an FYI, we have a new target-snowflake in the works that will be supported by the Meltano team and something like column name sanitization will definitely be handled but its still in development right now. So hopefully over time the need for this feature will grow smaller and smaller, but as of today I see it as a good addition to the tap!

edgarrmondragon commented 2 years ago

a regex-based schema mapper looks like a good generic solution

That would definitely be useful, and it's something that could be added to the SDK mapper class. Though I'm not sure we have a way for functions to operate on column names, only on field values 🤔.

aaronsteers commented 2 years ago

I love this idea, and it would be great to have something like this in the sdk. As of today, you can alias columns and streams, but rule-based renames and exclusions don't exist yet.

Would something like this work in the stream map config:

config:
  stream_maps:
    users:
      # Normal stream config here.
      # ...
      # New stuff:
      __rename_properties__:
        # One or more regex replacement strings here
      __exclude_properties__:
        # One or more regex match strings here
    __global__:
      __rename_streams__:
        # One or more regex replacement strings here
      __exclude_streams__:
        # One or more regex match strings here
      __rename_properties__:
        # One or more regex replacement strings here
      __exclude_properties__:
        # One or more regex match strings here

The __global__ option sets would be completely new, and would operate on all matching streams. So, within a stream's named config, property renames would only operate on that stream's priorities, while at the top level, property renaming and exclusion rules would apply to all streams' properties.

The only gotcha I can think of is the order of operations for renames, as related to other transformations that would apply to an explicitly named stream or property. My intuition is that regex-based renames and exclusions should be the last step (logically speaking, I mean), but I'd be open to other ideas as well.

What do you all think of this general approach?

aaronsteers commented 2 years ago

A nice thing about putting it into the SDK is that these operations would be available in both the tap and target layer (if one or both are built on the SDK), or in between them (with a standalone mapper plugin) if neither is.

MeltanoLabs / tap-sftp

Sanitize columns #4