dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.83k stars 1.62k forks source link

[CT-634] Node configs: tech debt #5236

Open jtcohen6 opened 2 years ago

jtcohen6 commented 2 years ago

Context

Problem

We need to better organize our tightly bundled NodeConfig. Many of these configs are only relevant to "executable" nodes (models, snapshots, seeds, and tests IFF store_failures is enabled). Some of them are only relevant to incremental models in particular. Analyses have no business with database/schema/alias.

Existing proposals

Incremental models. Many many configs are really just relevant to the incremental model materialization. As discussed in #5089: It feels a little icky that configs like unique_key and on_schema_change appear in the config for every node, when they're really just relevant to incremental models in particular. Could we isolate these to appear on incremental models only? This could be:

Metadata interface. Every time we add a new attribute to NodeConfig, it changes the manifest.json artifact schema. As discussed in https://github.com/dbt-labs/dbt-core/issues/4617: Given that users can supply whatever configs they want, to any node type that accepts config, should we even consider this a contracted metadata interface? I think the external contract for node.config should really be Dict[str, Any], and we should always be in the habit of "promoting" configs that are relevant and reliable to downstream metadata consumers as top-level node keys. Similar to what we did with node_info on logging events, which "promotes" config.materialized into node_info.materialized.

Adapter-specific configs (sort of discussed in https://github.com/dbt-labs/dbt-core/issues/4622): I'm actually not sure if AdapterConfig works at all today. My sense is that their keys don't show up in the manifest unless set, their types aren't really validated until used (at runtime), and their default values don't get passed through. The code here is kinda jank, to put it nicely.

leahwicz commented 2 years ago

@emmyoop @jtcohen6 could you work together to flesh this out some more? It can be a spike if we need some digging in and designing on what we want to do here