databrickslabs / ucx

Automated migrations to Unity Catalog
Other
198 stars 70 forks source link

[FEATURE]: Implement `pathlib.PurePath` interface for DBFS files, so that we can load workflow's `SparkPythonTask` files #1558

Closed nfx closed 1 week ago

nfx commented 3 months ago

Is there an existing issue for this?

Problem statement

The Python file to be executed. Cloud file URIs (such as dbfs:/, s3:/, adls:/, gcs:/) and workspace paths are supported. For python files stored in the Databricks workspace, the path must be absolute and begin with /. For files stored in a remote repository, the path must be relative. This field is required.

https://databricks-sdk-py.readthedocs.io/en/latest/dbdataclasses/jobs.html#databricks.sdk.service.jobs.SparkPythonTask

Proposed Solution

Create a similar abstraction to databricks.labs.ucx.mixins.wspath.WorkspacePath to load files from DBFS. Focus on open(...) method. We already have the most of DBFS APIs mapped out in databricks.sdk.mixins.files.DbfsExt and accessible through workspace client.

Additional Context

No response

asnare commented 1 month ago

This is the location where this is needed: https://github.com/databrickslabs/ucx/blob/69acd4a5eabb4cb2707b1de194502540725ba938/src/databricks/labs/ucx/source_code/jobs.py#L167