databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
443 stars 122 forks source link

`dbx deploy`: `named_parameters` dictionary gets an extraneous `"named_parameters"` entry when renaming `file:` paths #855

Open angus-dawson-idexx opened 1 year ago

angus-dawson-idexx commented 1 year ago

Expected Behavior

A deployment with dbx deploy [...] --write-specs-to-file=spec.json with a named_parameters definition like so:

named_parameters:
  conf-file: "file:fuse://path/to/some-file.yaml"

Should result in a specs file with the following named_parameters object:

"named_parameters": {
  "conf-file": "/dbfs/[...]/artifacts/path/to/some-file.yaml",
}

And only conf-file should be created as a keyword argument parameter in the workflow task on the web GUI.

Current Behavior

Instead, I get this:

"named_parameters": {
  "conf-file": "/dbfs/[...]/artifacts/path/to/some-file.yaml",
  "named_parameters": "/dbfs/[...]/artifacts/path/to/some-file.yaml"
}

And the workflow task on the web GUI ends up with two keyword argument parameters: conf-file and named_parameters.

Steps to Reproduce (for bugs)

Run a dbx deploy with a workflow that has python_wheel_task tasks with named_parameters, at least one of which has a value that starts with file:// or file:fuse://. Then check the keyword argument parameters of the task in the web GUI.

Context

I've traced the problem to this function: https://github.com/databrickslabs/dbx/blob/34bd186956915d70792621faa1a8bc38f38a9d41/dbx/api/adjuster/adjuster.py#L165-L169

And this part in PropertyAdjuster.traverse(): https://github.com/databrickslabs/dbx/blob/34bd186956915d70792621faa1a8bc38f38a9d41/dbx/api/adjuster/adjuster.py#L43-L48

After yielding the correct tuple for conf-file:

('file:fuse://path/to/config.yaml', {'conf-file': 'file:fuse://path/to/config.yaml'}, 'conf-file')

It then attempts to traverse item with index_in_parent, which is named_parameters, and since item is a string, traverse jumps here and terminates: https://github.com/databrickslabs/dbx/blob/34bd186956915d70792621faa1a8bc38f38a9d41/dbx/api/adjuster/adjuster.py#L66

And yields essentially a duplicate tuple except with the wrong index_in_parent:

('file:fuse://path/to/config.yaml', {'conf-file': 'file:fuse://path/to/config.yaml'}, 'named_parameters')

Your Environment

angus-dawson-idexx commented 1 year ago

If the call in line 47 is changed to self.traverse(item, _object, key), you still get a duplicate tuple, but at least the semantics of index_in_parent in that generator are correct and you just end up setting the same property twice, which is fine. I'm not sure if that ends up breaking other traversal cases though, or if there's a better fix to avoid visiting that element twice in the first place.