dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
10.02k stars 1.64k forks source link

[CT-520] [Enhancement] Schema files with repeated top level keys (e.g. `models`) results in only the last key being parsed #5114

Open jeremyyeo opened 2 years ago

jeremyyeo commented 2 years ago

Is there an existing issue for this?

Current Behavior

If you have a schema file with repeated top level keys:

# models/schema.yml
version: 2
models:
  - name: my_model_a
    description: "This is a"
    columns:
      - name: user_name
        tests:
          - not_null
models:
  - name: my_model_b
    description: "This is b"
    columns:
      - name: user_name
        tests:
          - not_null

Only the last models key will apply to the project when you execute dbt tasks that depend on them (like test, docs generate). We should raise an exception for such cases.

Expected Behavior

Perhaps we should raise an error in this case instead of silently just including the last models key of the schema file in our project's manifest.

Steps To Reproduce

  1. Add the above schema.yml file to your project along with 2 simple models:
-- models/my_model_a.sql
select 1 as user_name
-- models/my_model_b.sql
select 1 as user_name
  1. Do a dbt run to build our models.
  2. Do a dbt test and observe that we only tested my_model_b.
  3. Do a dbt docs generate && dbt docs serve and observe that only my_model_b has a description (my_model_a has no description).

Relevant log output

$ dbt test
04:02:08  Running with dbt=1.0.4
04:02:09  Found 2 models, 1 test, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
04:02:09  
04:02:09  Concurrency: 1 threads (target='dev')
04:02:09  
04:02:09  1 of 1 START test not_null_my_model_b_user_name................................. [RUN]
04:02:09  1 of 1 PASS not_null_my_model_b_user_name....................................... [PASS in 0.07s]
04:02:09  
04:02:09  Finished running 1 test in 0.24s.
04:02:09  
04:02:09  Completed successfully
04:02:09  
04:02:09  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Environment

- OS: macOS
- Python: 3.9.7
- dbt: 1.0.4

What database are you using dbt with?

postgres

Additional Context

image

jeremyyeo commented 2 years ago

Here's the relevant issue and a fix we could try which extends SafeLoader.

Add the extended class to our yaml_helper client then add in a ValueError in the following tuple:

https://github.com/dbt-labs/dbt-core/blob/37b8b65aadb811c6f56c81537da48a463857b0d9/core/dbt/clients/yaml_helper.py#L57

Like so:

    except (yaml.scanner.ScannerError, yaml.YAMLError, ValueError) as e:

Then our dbt test / dbt docs generate) should result in:

$ dbt test
04:10:34  Running with dbt=1.0.4
04:10:35  Encountered an error:
Parsing Error
  Error reading postgres: schema.yml - Runtime Error
    Duplicate 'models' key found in YAML.
jtcohen6 commented 2 years ago

(Maybe a warning instead of an error, for starters? Then we could switch it to error-level, in a future minor version)

emmyoop commented 2 years ago

@jeremyyeo thanks for this amazing write up and suggested solution!

I agree with @jtcohen6 that a warning would be a good place to start here. It would need to be caught separate from theyaml.scanner.ScannerError and yaml.YAMLError since we want to handle it differently. warn_or_error could be leveraged on a custom exception we raise in the fix you mention above as opposed to just a ValueError.

class UniqueKeyLoader(yaml.SafeLoader):
    def construct_mapping(self, node, deep=False):
        mapping = set()
        for key_node, value_node in node.value:
            key = self.construct_object(key_node, deep=deep)
            if key in mapping:
                raise DuplicateYamlKeyException(f"Duplicate {key!r} key found in YAML.")
            mapping.add(key)
        return super().construct_mapping(node, deep)

Then define that new exception in exceptions.py.

Additional it would be useful to add a test that checks we trigger a warning as expected. If/when we convert to an error, the test would need to be updated. All of our duplication tests live in 025_duplicate_model_tests but we are currently in the process of converting all of our tests to a more intuitive and consistent framework. The new test would live in /dbt/tests/functional in a new folder, possibly named duplication. It would need to be written in our new testing framework.

If you or another community member want to work on this I would be happy to help give some direction on writing the new test and any other questions there may be.

jeremyyeo commented 2 years ago

Thanks for the guidance @emmyoop + @jtcohen6. Will take a crack at the raising a new warning instead. Possibly will need help adding the test later but will circle back on that.

gshank commented 2 years ago

At this point we've reverted the original pull request because we didn't have time to figure out a solution for anchor overrides , which were causing error messages for duplicate keys, even though anchor overrides are legal syntax and should not be considered duplicate keys.

It's still possible that we could come up with a solution, but it requires more research. Some possibilities are 1) fixing in pyyaml and doing a pull request. 2) supporting a different yaml library that does correctly handle duplicate keys (ruamel has the same issue with spurious errors for anchor overrides). 3) expanding the code that looked for duplicate keys to handle anchor overrides correctly.

One possibility is yaml.parse, which does provide information on anchors and such. A fix using this would quite possibly make more sense as a pyyaml pr though. https://stackoverflow.com/questions/64460249/how-to-remove-duplicate-keys-in-yaml-file-automatically. yaml.parse returns information on anchors, so the linked code could possibly be modified to skip duplicates for anchor overrides.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Tonayya commented 3 days ago

Hi @gshank just re-opened this issue as it's still unresolved and would be really helpful if we could potentially implement your suggestion maybe. Just sharing an additional example of how this could impact users:

If you have a sources.yml file and within it you define a source multiple times (e.g. developer doesn't realize source is defined already so adds another one of the same), the most latest defined source overrides the first ones. So if you have like:

sources:
  - name: some_source
    database: some_database
    schema: some_schema
    tables:
      - name: ac
      - name: ab
      - name: abc
  - name: some_source
    database: some_database
    schema: some_schema
    tables:
      - name: ac
      - name: ab

With the above, you basically lose table abc.

Expected Behavior: We expect to either: Flag Duplicate Sources: Clearly indicate duplicate source definitions and prevent overwriting. Merge Duplicate Sources: Combine table definitions from multiple sources with the same name, database, and schema. Impact: This issue can lead to incorrect model definitions, unexpected errors, and wasted development time.