[CT-531] [Bug] More robust error messaging when keys are misspelled in yml files

saraleon1 commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

A PR jobs started failing with no clear error message available to the end user. The only error surfaced was:

Error message:
Running with dbt=0.21.0
* Deprecation Warning: The “packages” argument of adapter.dispatch() has been
deprecated. Use the “macro_namespace” argument instead.
Raised during dispatch for: type_string
For more information, see:
https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch
Found 347 models, 448 tests, 0 snapshots, 0 analyses, 701 macros, 2 operations, 21 seed files, 302 sources, 0 exposures
Found a seed (<removed>) >1MB in size at the same path, dbt cannot tell if it has changed: assuming they are the same

upon inspection of debug logs, the following log line was identified:

2021-11-04 08:50:01.342766 (MainThread): None

and upon inspection of more robust logging available to dbt cloud support, this stack trace was identified:

Traceback (most recent call last):
  File "/usr/src/app/sinter/clients/dbt.py", line 1147, in call
    result, success = dbt_main.handle_and_check(
  File "/usr/local/lib/python3.8/dist-packages/dbt/main.py", line 205, in handle_and_check
    task, res = run_from_args(parsed)
  File "/usr/local/lib/python3.8/dist-packages/dbt/main.py", line 258, in run_from_args
    results = task.run()
  File "/usr/local/lib/python3.8/dist-packages/dbt/task/runnable.py", line 433, in run
    self._runtime_initialize()
  File "/usr/local/lib/python3.8/dist-packages/dbt/task/runnable.py", line 155, in _runtime_initialize
    self.job_queue = self.get_graph_queue()
  File "/usr/local/lib/python3.8/dist-packages/dbt/task/runnable.py", line 146, in get_graph_queue
    return selector.get_graph_queue(spec)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 297, in get_graph_queue
    selected_nodes = self.get_selected(spec)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 278, in get_selected
    selected_nodes, indirect_only = self.select_nodes(spec)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 178, in select_nodes
    direct_nodes, indirect_nodes = self.select_nodes_recursively(spec)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 149, in select_nodes_recursively
    bundles = [
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 150, in <listcomp>
    self.select_nodes_recursively(component)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 149, in select_nodes_recursively
    bundles = [
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 150, in <listcomp>
    self.select_nodes_recursively(component)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 149, in select_nodes_recursively
    bundles = [
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 150, in <listcomp>
    self.select_nodes_recursively(component)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 147, in select_nodes_recursively
    direct_nodes, indirect_nodes = self.get_nodes_from_criteria(spec)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 104, in get_nodes_from_criteria
    collected = self.select_included(nodes, spec)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector.py", line 89, in select_included
    return set(method.search(included_nodes, spec.value))
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector_methods.py", line 514, in search
    if checker(previous_node, real_node):
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector_methods.py", line 439, in check_modified
    upstream_macro_change = self.check_macros_modified(new)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector_methods.py", line 435, in check_macros_modified
    return self.recursively_check_macros_modified(node, previous_macros)
  File "/usr/local/lib/python3.8/dist-packages/dbt/graph/selector_methods.py", line 419, in recursively_check_macros_modified
    macro_node = self.manifest.macros[macro_uid]
KeyError: None

From here, we compared git changes between this PR job and the previously successful PR job, and were able to identify 2 new sources, 3 new models, 3 new tests.

From inspection of these changes, it was identified that the source of the issue was misspellings in the new schema.yml files as follows:

not_null tests had the value “not null”
description was named “desciption"
columns was named “colums"

Which are misspelled invocations of a built in test, and 2 common yml keys. dbt Core should be able to surface misspellings like this more obviously to the end user

Expected Behavior

Helpful error messaging is surfaced to the end user when there are misspelling keys in the yml file

Steps To Reproduce

Misspell a key in a schema.yml file such as:
- not_null tests had the value “not null”
- description was named “desciption"
- columns was named “colums"
save changes and do a 'dbt run'
observe the opaque error messaging available

Relevant log output

No response

Environment

dbt Cloud
- dbt:0.21.0

What database are you using dbt with?

bigquery

Additional Context

No response

jtcohen6 commented 2 years ago

@saraleon1 Thanks for opening!

If the only error had been defining the test as not null instead of not_null, we would have raised a more helpful error message (still not amazing, but at least better):

Compilation Error in test not_model_a_id (models/schema.yml)
  'test_not' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

The misspelled desciption and colums are more insidious. It's pretty surprising that this doesn't raise a validation error:

version: 2
models:
  - name: model_a
    random_key: does nothing

@gshank I know we do basic validation for the models block when reading files, to make sure we're not setting parsing up for horrible failure later on. Do you have a sense of where we could validate to make sure that the keys in a node update block match our expected/supported keys? It's not like random_key is doing anything above; it doesn't appear as a node property in the manifest.

I haven't looked into this code extensively, but my sense is that we do one of the following:

Change a low-level class to no longer be extensible / supporting additional properties

Since unrecognized fields get popped into the _extra dict, raise a parsing error somewhere here? if there are unrecognized top-level keys:

    if block.target._extra:
        raise ParsingException(f"Extra keys not allowed! {block.target._extra}")

$ dbt --no-partial-parse test
16:35:29 | [ info  ] | Running with dbt=1.0.0-rc1
16:35:29 | [ error ] | Encountered an error:
Parsing Error
Extra keys not allowed! {'random_key': 'does nothing'}

I do think this could be a potential breaking change, in the sense of raising an error for existing projects that have cruft or misspelled properties hanging around. This feels like a great thing to sneak in for v1.0.0 if it would be simple, or a later minor version if it would be a much larger lift.

gshank commented 2 years ago

There's some minimal validation done in 'get_key_dicts', but we don't actually validate that there are only valid keys in the dictionaries anywhere; we basically set the values into the attributes when constructing objects, so we just ignore invalid keys. Better validation seems like a good idea. I'm not exactly sure what the best place and way to do it is. If we validate when everything is a dictionary, then we have to update the valid keys when a config or attribute is added, and it would have to be specific to the patch type. But validating as an object isn't going to work because we can't set attributes that don't exist. I think the best approach would be to have validation code and separate validation data for each entry type, and validate somewhere around the get_key_dicts, but with separate validation methods for each schema subparser.

jeremyyeo commented 2 years ago

Hey team, I've got a relevant one for you... if you have a malformed dbt_project.yml:

name: "my_dbt_project"
version: "1.0.0"
config-version: 2

profile: "snowflake"

model-paths: ["models"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_modules"
  - "dbt_packages"

# prettier-ignore
models:
  +materialized:table
    my_dbt_project:

A run in dbt CLI provides some helpful hints:

$ dbt run
10:00:23  Encountered an error:
Runtime Error
  Syntax error near line 20
  ------------------------------
  17 | # prettier-ignore
  18 | models:
  19 |   +materialized:table
  20 |     my_dbt_project:

  Raw Error:
  ------------------------------
  mapping values are not allowed in this context
    in "<unicode string>", line 20, column 19

But because no dbt logs are generated from the Runtime Error - we get this when looking at a run history in dbt Cloud:

You have no idea what's going on.

I suppose my questions are:

Is this pertinent to this ticket (misspellings in yaml) or is this its own issue (malformed yaml causing runtime errors generating no logs)?
Is this a dbt Cloud thing (because currently run history logs are effectively "dbt generated logs" only and not things outside of that realm - Python stack trace) or a dbt-core thing (because runtime errors generate no "dbt generated logs")? At some level I suppose it's both.

Thanks :)

jtcohen6 commented 2 years ago

@jeremyyeo I think the resolution for that discrepancy would be the proposal in #4836 — i.e. every exception that dbt-core throws should always also be a log message.

(@nathaniel-may Does that sound right to you, or am I misunderstanding that proposal?)

nathaniel-may commented 2 years ago

@jtcohen6 Yes, I imagine we would want to log errors before they are thrown whenever possible. I'm imagining having a single exception event type that wraps the parent exception to achieve this with minimal code.

nathaniel-may commented 2 years ago

estimation

option 1: we add some simple validation. ~3
option 2: we investigate making a more holistic design choice that is more future proof for expected keys. ~2 day spike to start.

will-sargent-dbtlabs commented 2 years ago

I would also like to add on the Key Validation front, that optional keys like "description" quietly fail completely if they are misspelled. No error or warning, just no results in the docs.

I spent about 20 minutes trying to see that I was missing the i in tion (descripton) and that's why it wasn't appearing in the docs, but running just fine. Maybe a warning for keys that are optional but not recognized?

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

jtcohen6 commented 1 year ago

I was about to remove the stale label and add paper_cut, but I see that we already have a newer "paper cut" issue that's basically the same as this one: https://github.com/dbt-labs/dbt-core/issues/5605

I'm going to close this older one as a duplicate of that newer one. This is still very much a UX improvement that's on my mind!

mazinesy commented 11 months ago

:wave: I created a PR with tests to demonstrate the issue.

dbt-labs / dbt-core