[CT-1216] [Bug] {{ concat }} macro not working on Spark / Databricks

dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

https://getdbt.com

Apache License 2.0

9.65k stars 1.6k forks source link

[CT-1216] [Bug] {{ concat }} macro not working on Spark / Databricks #5888

Closed jelstongreen closed 2 years ago

jelstongreen commented 2 years ago

Is this a new bug in dbt-core?

[X] I believe this is a new bug in dbt-core
[X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

hash(concat(['a','b','c']))

compiles to:

md5(cast(a, b, c as string))

Expected Behavior

The behaviour should mimic the previously used :

dbt_utils.hash(dbt_utils.concat(['a','b','c']))

which compiles too:

md5(cast(concat(a, b, c) as string))

Steps To Reproduce

Create an example model containing:

{% set native_query = hash(concat(['a','b','c'])) %}

{{ log(native_query, info=True) }}

{% set dbt_utils_query = dbt_utils.hash(dbt_utils.concat(['a','b','c'])) %}

{{ log(dbt_utils_query, info=True) }}

SELECT 1

Relevant log output

17:00:51  1 of 1 START view model global_update_metrics.test ............................. [RUN]
17:00:51  md5(cast(a, b, c as string))
17:00:51  Warning: the `hash` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `hash` (no prefix) instead. The datalake_models.test model triggered this warning.
17:00:51  md5(cast(concat(a, b, c) as string))
17:00:51  1 of 1 OK created view model global_update_metrics.test ........................ [OK in 0.87s]


### Environment

```markdown
- OS: Mac 12.6
- Python: 3.8.12
- dbt: 1.2.1

Which database adapter are you using with dbt?

spark

Additional Context

Databricks

jtcohen6 commented 2 years ago

@jelstongreen Thanks for opening, and providing the clear reproduction case. Given the log message you're seeing, it looks like the concat() call gets skipped over entirely. I'm not sure why that would be...

Which version of dbt_utils do you have installed? I'm not able to reproduce this locally using the latest version. I see dbt correctly calling spark__concat, and including it in the rendered code:

# packages
packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2

{% set native_query = hash(concat(['a','b','c'])) %}

{{ log("Without namespace: " ~ native_query, info=True) }}

{% set dbt_utils_query = dbt_utils.hash(dbt_utils.concat(['a','b','c'])) %}

{{ log("With dbt_utils namespace: " ~ dbt_utils_query, info=True) }}

SELECT 1

$ dbt parse
17:37:38  Running with dbt=1.2.1
17:37:38  Start parsing.
17:37:38  Dependencies loaded
17:37:38  ManifestLoader created
17:37:38  Without namespace: md5(cast(concat(a, b, c) as string))
17:37:38  Warning: the `concat` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `concat` (no prefix) instead. The test.my_model model triggered this warning.
17:37:38  Warning: the `hash` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `hash` (no prefix) instead. The test.my_model model triggered this warning.
17:37:38  With dbt_utils namespace: md5(cast(concat(a, b, c) as string))
17:37:38  Manifest loaded
17:37:38  Manifest checked
17:37:38  Flat graph built
17:37:38  Manifest loaded
17:37:38  Performance info: target/perf_info.json
17:37:38  Done.
...

jelstongreen commented 2 years ago

Hey @jtcohen6 thanks for looking into this! These are my packages:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
  - package: dbt-labs/spark_utils
    version: 0.3.0
  - package: dbt-labs/codegen
    version: 0.8.0
  - package: brooklyn-data/dbt_artifacts
    version: 2.0.0

I'm thinking this could be do with my dispatch config which I've not really updated since first starting out with dbt - does this need to be amended?

dispatch:
  - macro_namespace: dbt_utils
    search_order:
      - my_project
      - spark_utils
      - dbt_utils

jtcohen6 commented 2 years ago

@jelstongreen Yeah! What happens if you remove that dispatch config entirely? The migration of utility macros into dbt-core should mean that "shim" packages like spark_utils are no longer needed here

jelstongreen commented 2 years ago

@jtcohen6 Unfortunately no change with that section removed:

23:47:14  Running with dbt=1.2.1
23:47:14  Start parsing.
23:47:14  Dependencies loaded
23:47:14  ManifestLoader created
23:47:17  Without namespace md5(cast(a, b, c as string))
23:47:17  Warning: the `concat` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `concat` (no prefix) instead. The datalake_models.test model triggered this warning.
23:47:17  Warning: the `hash` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `hash` (no prefix) instead. The datalake_models.test model triggered this warning.
23:47:17  With dbt_utils namespace: md5(cast(concat(a, b, c) as string))
23:47:31  Manifest loaded
23:47:31  Manifest checked
23:47:32  Flat graph built
23:47:32  Manifest loaded
23:47:32  Performance info: target/perf_info.json
23:47:32  Done.

jelstongreen commented 2 years ago

I've worked this one out:

We had an internal macro called concat which when we were using dbt_utils.concat would not conflict because of the explicit namespace. This was the same for all packages which prefixed their use of concat with dbt_utils. However, as these packages now use concat with no package prefix it was conflicting with our internal macro. I'm not sure why our internal macro would just not provide any output at all in this case but I have renamed the macro to something more unique to our use case (concatenation with affixes).

Probably no need for any resolution here although it would be handy to know if an internal macro is conflicting with a built in macro of the same name, particularly after a version change as can be difficult to determine otherwise. Thanks for your help @jtcohen6 !

jtcohen6 commented 2 years ago

@jelstongreen Ah, thanks for the update!

Definitely open to thinking about how it could be easier to debug these ones in the future. It's true that there's more possibility for namespace collision when calling macros from the global namespace, though with the benefit of making it simpler for folks to override specific macro behavior when they see the need.

Unfortunately, because this override happens during macro resolution, rather than dispatch resolution (subtle distinction), there's no way to identify at parse time / via depends_on.macros that dbt_utils.surrogate_key will be calling my_project.concat instead of dbt_utils.concat. This is what appears in manifest.json:

    "macro.dbt_utils.default__surrogate_key": {
      "unique_id": "macro.dbt_utils.default__surrogate_key",
      "package_name": "dbt_utils",
      "root_path": "/Users/jerco/dev/scratch/testy/dbt_packages/dbt_utils",
      "path": "macros/sql/surrogate_key.sql",
      "original_file_path": "macros/sql/surrogate_key.sql",
      "name": "default__surrogate_key",
      "macro_sql": "\n\n{%- macro default__surrogate_key(field_list) -%}\n\n{%- if varargs|length >= 1 or field_list is string %}\n\n{%- set error_message = '\nWarning: the `surrogate_key` macro now takes a single list argument instead of \\\nmultiple string arguments. Support for multiple string arguments will be \\\ndeprecated in a future release of dbt-utils. The {}.{} model triggered this warning. \\\n'.format(model.package_name, model.name) -%}\n\n{%- do exceptions.warn(error_message) -%}\n\n{# first argument is not included in varargs, so add first element to field_list_xf #}\n{%- set field_list_xf = [field_list] -%}\n\n{%- for field in varargs %}\n{%- set _ = field_list_xf.append(field) -%}\n{%- endfor -%}\n\n{%- else -%}\n\n{# if using list, just set field_list_xf as field_list #}\n{%- set field_list_xf = field_list -%}\n\n{%- endif -%}\n\n\n{%- set fields = [] -%}\n\n{%- for field in field_list_xf -%}\n\n    {%- set _ = fields.append(\n        \"coalesce(cast(\" ~ field ~ \" as \" ~ type_string() ~ \"), '')\"\n    ) -%}\n\n    {%- if not loop.last %}\n        {%- set _ = fields.append(\"'-'\") -%}\n    {%- endif -%}\n\n{%- endfor -%}\n\n{{ hash(concat(fields)) }}\n\n{%- endmacro -%}",
      "resource_type": "macro",
      "tags": [],
      "depends_on": {
        "macros": [
          "macro.dbt_utils.type_string",
          "macro.dbt_utils.hash",
          "macro.dbt_utils.concat"
        ]
      },

FYI @dbeatty10 @joellabes - this is an issue that would have been prevented by calling the macro explicitly as dbt.concat, instead of just concat, within dbt_utils.surrogate_key. I know we've seen other issues with the former approach, though: #5720.

I'm going to close this specific issue as resolved for now.

joellabes commented 1 year ago

Oh boy, I had no idea that dbt.something() and something() would give different results! I thought it was purely down to clarity or not.

Personally I prefer {{ dbt.something() }} because it makes it more obvious where the magic is coming from (and I just had an issue where someone else was confused by this: https://github.com/dbt-labs/dbt-utils/issues/666). I think you and Doug decided the opposite in my absence and I was OK with that when I thought it didn't change anything.

Now that it does... 🤔 which failure case is more common/difficult to work around? This or the one raised in 5720?

jtcohen6 commented 1 year ago

Oh boy, I had no idea that dbt.something() and something() would give different results! I thought it was purely down to clarity or not.

@joellabes To be clear, they yield different results only if you've defined a macro in your own project named {{ concat() }}, which then wins the race for global macro resolution. Whereas, if you've defined a macro in your project named {{ default__concat() }}, that wins the race for dispatch in the dbt namespace. I think the latter is preferable, because it's a bit harder to find yourself there accidentally, and the dependency shows up in depends_on.macros.

So, I also find myself leaning toward explicit namespaces... I don't remember exactly why @dbeatty10 and I opted for prefixless, but knowing us, there's definitely a thread somewhere.

The bug in https://github.com/dbt-labs/dbt-core/issues/5720 is decidedly a bug, not just a point of user confusion — it's something we can try to fix. It's a gnarly bug, though. We can invest more time in trying to resolve it, which would clear the way toward us being able to recommend explicit namespaces unambiguously, but it won't be fixed by dbt-core v1.3 + dbt-utils v1.0. At the same time, the issue will just go away when we remove the deprecated back-compat macros from dbt_utils.

dbeatty10 commented 1 year ago

Wholeheartedly agree with using explicit namespaces IFF we can't avoid bugs otherwise! i.e., {{ dbt.type_string() }} rather than {{ type_string() }}

@jtcohen6 our original discussion was basically this:

Include dbt. prefix for macros? Or drop?

The result feels more readable without the prefix to me.

I prefer without the prefix, too.

An underlying assumption for me was that with/without prefix was equivalent. We should be explicit if we will have bugs otherwise. We will need to add dbt. prefixes to all the code examples here.

To fully unpack my ordered preferences for syntax (if we could create our ideal world):

select concat(['d', 'b', 't'])
select {{ concat(['d', 'b', 't']) }}
select {{ dbt.concat(['d', 'b', 't']) }}

In the first instance, all the Jinja has melted away as-if concat were a SQL standard function implemented by all databases (or if there existed some kind of black magic that made it appear that way 😉 ).

jtcohen6 commented 1 year ago

@dbeatty10 I think I may have a fix for the bug that crops up when users explicitly specify dbt.*: https://github.com/dbt-labs/dbt-core/pull/5907

Given that, I'm inclined to encourage explicit dbt.* in user projects and packages, to avoid unintended namespace collisions. (We're not the only people who have this problem: https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions)

We will need to add dbt. prefixes to all the code examples here.

✅ We can open an issue for this.

@joellabes I imagine we'd also want to update the references in dbt-utils, which might be a much bigger lift, and warrants opening a separate issue.

dbeatty10 commented 1 year ago

We will need to add dbt. prefixes to all the code examples here.

✅ We can open an issue for this.

I already have a PR open for this page in our documentation, so I'll add it there:

https://github.com/dbt-labs/docs.getdbt.com/pull/2050

joellabes commented 1 year ago

To fully unpack my ordered preferences for syntax (if we could create our ideal world):

My preferences are actually the exact inverse of yours! If we are being clever, then I want to see where the magic is coming from. Mostly because:

as-if concat were a SQL standard function implemented by all databases

is a great greenfield stance to take, but it's already a SQL standard function implemented by a lot of databases, so this would open us up to wild amounts of clashes - it would really only be possible with the dbt proxy server rewriting everything before the warehouse saw it. Which isn't impossible I guess! But certainly a surprise for the new employee when they discover that something is swizzling their functions from under them. Maybe I'm thinking too small? I'm open to the scales falling from my eyes

I think I may have a fix for the bug that crops up when users explicitly specify dbt.*: #5907

:infinitypartyparrot: