Open jeremyyeo opened 2 years ago
Here's the relevant issue and a fix we could try which extends SafeLoader
.
Add the extended class to our yaml_helper
client then add in a ValueError
in the following tuple:
Like so:
except (yaml.scanner.ScannerError, yaml.YAMLError, ValueError) as e:
Then our dbt test
/ dbt docs generate
) should result in:
$ dbt test
04:10:34 Running with dbt=1.0.4
04:10:35 Encountered an error:
Parsing Error
Error reading postgres: schema.yml - Runtime Error
Duplicate 'models' key found in YAML.
(Maybe a warning instead of an error, for starters? Then we could switch it to error-level, in a future minor version)
@jeremyyeo thanks for this amazing write up and suggested solution!
I agree with @jtcohen6 that a warning would be a good place to start here. It would need to be caught separate from theyaml.scanner.ScannerError
and yaml.YAMLError
since we want to handle it differently. warn_or_error
could be leveraged on a custom exception we raise in the fix you mention above as opposed to just a ValueError
.
class UniqueKeyLoader(yaml.SafeLoader):
def construct_mapping(self, node, deep=False):
mapping = set()
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=deep)
if key in mapping:
raise DuplicateYamlKeyException(f"Duplicate {key!r} key found in YAML.")
mapping.add(key)
return super().construct_mapping(node, deep)
Then define that new exception in exceptions.py
.
Additional it would be useful to add a test that checks we trigger a warning as expected. If/when we convert to an error, the test would need to be updated. All of our duplication tests live in 025_duplicate_model_tests but we are currently in the process of converting all of our tests to a more intuitive and consistent framework. The new test would live in /dbt/tests/functional
in a new folder, possibly named duplication
. It would need to be written in our new testing framework.
If you or another community member want to work on this I would be happy to help give some direction on writing the new test and any other questions there may be.
Thanks for the guidance @emmyoop + @jtcohen6. Will take a crack at the raising a new warning instead. Possibly will need help adding the test later but will circle back on that.
At this point we've reverted the original pull request because we didn't have time to figure out a solution for anchor overrides , which were causing error messages for duplicate keys, even though anchor overrides are legal syntax and should not be considered duplicate keys.
It's still possible that we could come up with a solution, but it requires more research. Some possibilities are 1) fixing in pyyaml and doing a pull request. 2) supporting a different yaml library that does correctly handle duplicate keys (ruamel has the same issue with spurious errors for anchor overrides). 3) expanding the code that looked for duplicate keys to handle anchor overrides correctly.
One possibility is yaml.parse, which does provide information on anchors and such. A fix using this would quite possibly make more sense as a pyyaml pr though. https://stackoverflow.com/questions/64460249/how-to-remove-duplicate-keys-in-yaml-file-automatically. yaml.parse returns information on anchors, so the linked code could possibly be modified to skip duplicates for anchor overrides.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Hi @gshank just re-opened this issue as it's still unresolved and would be really helpful if we could potentially implement your suggestion maybe. Just sharing an additional example of how this could impact users:
If you have a sources.yml
file and within it you define a source multiple times (e.g. developer doesn't realize source is defined already so adds another one of the same), the most latest defined source overrides the first ones. So if you have like:
sources:
- name: some_source
database: some_database
schema: some_schema
tables:
- name: ac
- name: ab
- name: abc
- name: some_source
database: some_database
schema: some_schema
tables:
- name: ac
- name: ab
With the above, you basically lose table abc
.
Expected Behavior: We expect to either: Flag Duplicate Sources: Clearly indicate duplicate source definitions and prevent overwriting. Merge Duplicate Sources: Combine table definitions from multiple sources with the same name, database, and schema. Impact: This issue can lead to incorrect model definitions, unexpected errors, and wasted development time.
Is there an existing issue for this?
Current Behavior
If you have a schema file with repeated top level keys:
Only the last
models
key will apply to the project when you execute dbt tasks that depend on them (liketest
,docs generate
). We should raise an exception for such cases.Expected Behavior
Perhaps we should raise an error in this case instead of silently just including the last
models
key of the schema file in our project's manifest.Steps To Reproduce
schema.yml
file to your project along with 2 simple models:dbt run
to build our models.dbt test
and observe that we only testedmy_model_b
.dbt docs generate && dbt docs serve
and observe that onlymy_model_b
has a description (my_model_a
has no description).Relevant log output
Environment
What database are you using dbt with?
postgres
Additional Context