dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
297 stars 119 forks source link

Snowflake External Tables - Support for multiple string values for SQL NULLs #118

Closed VasuBalakrishnan closed 1 year ago

VasuBalakrishnan commented 2 years ago

This is an extension to issue #63.

I would like to propose extending this feature similar to CREATE FILE FORMAT's NULL_IF parameter so that we can treat multiple values to be NULLs instead of just single 'null' in the input.

Proposed Solution:

Extend the external yaml section with a null_if parameter

external:
   null_if: ('null', '', 'n/a')

Update the snowflake__create_external_table macro to read the above setting and modify the case statement to include these values

  {%- set null_if = external.null_if if external.null_if else ('null', '') %}
  ...
 (case when is_null_value({{col_id}}) or lower({{col_id}}) in {{null_if}} then null else {{col_id}} end)
  ....

This would translate the above input null_if values to

ColName float as ((case when is_null_value(value:c1) or lower(value:c1) in ('null', '', 'n/a') then null else value:c1 end)::float)
jtcohen6 commented 2 years ago

@VasuBalakrishnan Thanks for opening, and for bringing my attention to the NULL_IF format parameter.

Instead of reinventing the wheel with our own SQL logic, it sounds like you could achieve the desired behavior via:

external:
   file_format: "(
     type = csv
     null_if = (null', '', 'n/a')
   )"

That feels better than trying to write the perfect SQL column expression on our own, with all the same configurability as Snowflake's format type options.

As an unrelated note: It does seem worthwhile to refactor file_format to support dictionary specification, in addition to strings, so that you could write this a little more nicely, e.g.:

external:
   file_format:
     type: csv
     null_if: "('null', '', 'n/a')"
jeremyyeo commented 2 years ago

Linking https://github.com/dbt-labs/dbt-external-tables/issues/132 here too.

I think adding null_if to the end of the create or replace external table ... file format ... statement would be a good idea to avoid expanding case statements as you pointed out @jtcohen6.

With your suggested pattern / refactor to support dict, I think we should be able to just change these bits:

https://github.com/dbt-labs/dbt-external-tables/blob/4ea5203aa22ccf55222619c784de5c3b10cbeac3/macros/plugins/snowflake/create_external_table.sql#L34-L36

To:

    {% if external.integration -%} integration = '{{external.integration}}' {%- endif %}
    file_format = {{external.file_format.type}}
    {% if external.file_format.null_if -%} null_if = {{external.file_format.null_if}} {%- endif %}
{% endmacro %}
jtcohen6 commented 2 years ago

To consolidate our findings, and my current feelings:

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.