dbt-labs / dbt-project-evaluator

This package contains macros and models to find DAG issues automatically
https://dbt-labs.github.io/dbt-project-evaluator/latest/
Apache License 2.0
436 stars 64 forks source link

redshift__type_string overwrite is causing all existing models to be marked as modified #469

Closed mweso-softserve closed 1 month ago

mweso-softserve commented 3 months ago

Describe the bug

I configured dbt_project_evaluator with dispatch required for redshift. My intention is to use it in a slim CI. I have a job in DBT Cloud which defers to the production environment. After configuring dispatch all models are recognised as modified.

Steps to reproduce

  1. Configure dbt_project_evaluator package with dispatch in existing redshift project
  2. Run
    dbt ls --select state:modified+ --exclude package:dbt_project_evaluator
  3. See that models changed

Expected results

It should be implemented in a way that does not overwrite core macros for all existing models.

Actual results

After package configuration I expect no models to be marked as changed.

Screenshots and log output

System information

The contents of your packages.yml file:


dispatch:
  - macro_namespace: dbt
    search_order: ['dbt_project_evaluator','dbt']

Which database are you using dbt with?

The output of dbt --version:

dbt Cloud CLI - 0.38.4 (0e2097036f895b2f615c53e144e625f88c8f9012 2024-06-21T16:58:57Z)

DBT version 12:26:46 Running with dbt=1.6.16 12:26:47 Registered adapter: redshift=1.6.7

Additional context

https://github.com/dbt-labs/dbt-project-evaluator/blob/main/macros/cross_db_shim/redshift_shims.sql

Are you interested in contributing the fix?

I can contribute, however I'm not sure how to fix it at this point.

b-per commented 3 months ago

Could you try without dispatch as well to see what you get?

I know that redshift__type_string() was required at some point but I don't know if it is still the case.

mweso-softserve commented 3 months ago

I tried that:

Database Error in model base_node_relationships (models/staging/graph/base/base_node_relationships.sql)
  value too long for type character varying(1)

I'm not sure why that is though. Maybe some redshift optimisation for nulls as TEXT such as:

select 
    cast(null as TEXT) as resource_id,
    cast(null as TEXT) as direct_parent_id,
    cast(True as boolean) as is_primary_relationship

from dummy_cte
where false
  );

This is supposedly VARCHAR(256): https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html

b-per commented 3 months ago

This might be slightly related to the other issue/PR #467 on Redshift.

Do you have jq installed? If not, could you install it and do (I am on Mac, but we could do something similar in Windows)

cat target/manifest.json | jq '.macros | .[] | select(.name | contains("type_string"))'

and copy the results here?

My results look like

{
  "name": "type_string",
  "resource_type": "macro",
  "package_name": "dbt",
  "path": "macros/utils/data_types.sql",
  "original_file_path": "macros/utils/data_types.sql",
  "unique_id": "macro.dbt.type_string",
  "macro_sql": "\n\n{%- macro type_string() -%}\n  {{ return(adapter.dispatch('type_string', 'dbt')()) }}\n{%- endmacro -%}\n\n",
  "depends_on": {
    "macros": [
      "macro.dbt.default__type_string"
    ]
  },
  "description": "",
  "meta": {},
  "docs": {
    "show": true,
    "node_color": null
  },
  "patch_path": null,
  "arguments": [],
  "created_at": 1719397302.694219,
  "supported_languages": null
}
{
  "name": "default__type_string",
  "resource_type": "macro",
  "package_name": "dbt",
  "path": "macros/utils/data_types.sql",
  "original_file_path": "macros/utils/data_types.sql",
  "unique_id": "macro.dbt.default__type_string",
  "macro_sql": "{% macro default__type_string() %}\n    {{ return(api.Column.translate_type(\"string\")) }}\n{% endmacro %}",
  "depends_on": {
    "macros": []
  },
  "description": "",
  "meta": {},
  "docs": {
    "show": true,
    "node_color": null
  },
  "patch_path": null,
  "arguments": [],
  "created_at": 1719397302.694326,
  "supported_languages": null
}
{
  "name": "redshift__type_string",
  "resource_type": "macro",
  "package_name": "dbt_project_evaluator",
  "path": "macros/cross_db_shim/redshift_shims.sql",
  "original_file_path": "macros/cross_db_shim/redshift_shims.sql",
  "unique_id": "macro.dbt_project_evaluator.redshift__type_string",
  "macro_sql": "{%- macro redshift__type_string() -%}\n  {{ \"VARCHAR(600)\" }}\n{%- endmacro %}",
  "depends_on": {
    "macros": []
  },
  "description": "",
  "meta": {},
  "docs": {
    "show": true,
    "node_color": null
  },
  "patch_path": null,
  "arguments": [],
  "created_at": 1719397302.810622,
  "supported_languages": null
}
mweso-softserve commented 3 months ago

Here's the result with custom dispatch defined:

{
  "name": "type_string",
  "resource_type": "macro",
  "package_name": "dbt",
  "path": "macros/utils/data_types.sql",
  "original_file_path": "macros/utils/data_types.sql",
  "unique_id": "macro.dbt.type_string",
  "macro_sql": "\n\n{%- macro type_string() -%}\n  {{ return(adapter.dispatch('type_string', 'dbt')()) }}\n{%- endmacro -%}\n\n",
  "depends_on": {
    "macros": [
      "macro.dbt_project_evaluator.redshift__type_string"
    ]
  },
  "description": "",
  "meta": {},
  "docs": {
    "show": true,
    "node_color": null
  },
  "patch_path": null,
  "arguments": [],
  "created_at": 1719921400.9395182,
  "supported_languages": null
}
{
  "name": "default__type_string",
  "resource_type": "macro",
  "package_name": "dbt",
  "path": "macros/utils/data_types.sql",
  "original_file_path": "macros/utils/data_types.sql",
  "unique_id": "macro.dbt.default__type_string",
  "macro_sql": "{% macro default__type_string() %}\n    {{ return(api.Column.translate_type(\"string\")) }}\n{% endmacro %}",
  "depends_on": {
    "macros": []
  },
  "description": "",
  "meta": {},
  "docs": {
    "show": true,
    "node_color": null
  },
  "patch_path": null,
  "arguments": [],
  "created_at": 1719921400.9397166,
  "supported_languages": null
}
{
  "name": "redshift__type_string",
  "resource_type": "macro",
  "package_name": "dbt_project_evaluator",
  "path": "macros/cross_db_shim/redshift_shims.sql",
  "original_file_path": "macros/cross_db_shim/redshift_shims.sql",
  "unique_id": "macro.dbt_project_evaluator.redshift__type_string",
  "macro_sql": "{%- macro redshift__type_string() -%}\n  {{ \"VARCHAR(600)\" }}\n{%- endmacro %}",
  "depends_on": {
    "macros": []
  },
  "description": "",
  "meta": {},
  "docs": {
    "show": true,
    "node_color": null
  },
  "patch_path": null,
  "arguments": [],
  "created_at": 1719921401.334528,
  "supported_languages": null
}

Here's how it looks to me: Without macro which redefines redshift string type for all models it fails because apparently redshift does some optimisations when cast is invoked on something like select cast(null as TEXT) as resource_id and internally represents it as VARCHAR(1). This is a problem since the dummy select is used to define a table schema and after it's created another macro inserts data into it. I don't think that the right solution is to redefine what redshift string type is for all models. Unfortunately now it's impossible to define dispatch in context of subset of models or package, unless I missed something.

Other issues you pointed out to seem to be similar indeed. The error thrown mentions varying(1) error, so it's strange that increasing value in type_large_string.sql fixed it. I'd expect the problem was either not using dispatch config or error to mention values larger than 5000.
BTW why some macros are in macros/unpack and some in macros/cross_db_shim?

mweso-softserve commented 3 months ago

I created PR https://github.com/dbt-labs/dbt-project-evaluator/pull/473 Integration tests fail and it looks like the definition of macro I proposed can cause circular recursion if someone will run this with dispatch defined using type_string() in database different than redshift. I stumbled upon this thread https://github.com/dbt-labs/dbt-core/issues/7103. According to this suggestion I think the solution here could be abandoning usage of type_string in definition of dbt_project_evaluator modes in favour of api.Column.string_type(600) directly as type definition.

I tested that with my local instance on redshift and it worked but I'm not sure if there are any unintended consequences of this. For example cons section of this approach mentioned:

These are not suitable for direct comparison with column data types from metadata from the information_schema.columns view that many databases support.

What do you think?

mweso-softserve commented 3 months ago

I created second PR #475 with the call to api.Column.string_type(600) and special handling for databricks and bigquery. Known issue: https://github.com/dbt-labs/dbt-bigquery/issues/665 (bigquery) and https://github.com/dbt-labs/dbt-spark/issues/715 (databricks).

mweso-softserve commented 2 months ago

@b-per please let me know when can I expect this to be evaluated by someone. My proposed solution changes how string field in defined for package internal models without affecting any existing models, which I think is the way to go.