databrickslabs / ucx

Automated migrations to Unity Catalog
Other
240 stars 86 forks source link

[FEATURE]: Build and display dataset lineage to partition/schedule code migrations more effectively #1415

Open rwforest opened 7 months ago

rwforest commented 7 months ago

Is there an existing issue for this?

Problem statement

Table mapping does not solve everything, chances are there's still error after migration. Since HMS lineage is there, UCX should target merging with dependencyGraph

flowchart TD

storage_path -->|reads| view
storage_path -->|reads| table
storage_path -->|reads| notebook
storage_path -->|reads| py_file
storage_path -->|reads| redash_query
storage_path -->|reads| pipeline

table --> view
view --> table

table -->|reads| notebook
notebook -->|writes| table

table -->|reads| pipeline
pipeline -->|writes| table

table -->|reads| py_file
py_file -->|writes| table

table -->|reads| redash_query
redash_query -->|writes| table
redash_query --> dashboard
dashboard --> warehouse

table -->|reads| lakeview_dashboard
lakeview_dashboard --> warehouse

notebook --> pipeline
pipeline --> job
notebook --> job
wheel --> job

py_file --> job
py_file --> git_repo
py_file --> wheel

notebook --> git_repo
git_repo -->|?| job

cluster_policy --> cluster
cluster_policy --> job
job --> task
task --> cluster
cluster --> init_script

warehouse -.-> cluster

Proposed Solution

Merge HMS lineage with dependencyGraph. While it is dependent on the version of the DBR, it should start with the highest runtime and then backfill anything that's not captured using other means. Static lineage parsing or Spark listener.

scope:

asset has_owner_user listing speed AST analysis required
storage path no slow (via AST analysis) yes
view no medium, via tables.scala yes
table no medium, via tables.scala no
pipeline yes fast ** yes
notebook yes slow, via workflow linter yes
wheel no slow, via linter yes
job yes fast no
cluster yes fast no
cluster_policy yes fast no
git_repo no - no
py_file no slow, via workflow linter yes
redash query yes medium yes
redash dashboard yes medium no
lakeview dashboard yes fast yes
warehouse yes fast no

Optionally, we can create multiple copies of the same graph with starting points from a single table to show full migration scope.

Outcome:

Additional Context

No response

nfx commented 7 months ago

@rwforest The request is not clear, could you elaborate?

rwforest commented 7 months ago

@nfx When doing table replacement, source to target table is only the first pass. How do we guarantee that anything will run successfully? That's the goal of the migration. So the goal here is, for example, given 1000 notebooks, I need to be able to understand which chain of commands are the most critical. In terms of dependencyGraph we can check the incoming and outgoing edges. If we know a notebook is an orphan, then we don't attempt to fix it. Right now even if you fix all the cells, it is only the beginning of all the errors that will surface eventually.

JCZuurmond commented 4 months ago

@rwforest : Could you clarify the following:

  1. What is the HMS dependency graph?
  2. Why is this dependent on the Databricks runtime?

Generally speaking, I would like capability to plan code migrations using UCX based on a dependency graph of the (to-be-migrated) tables linted from the code

rwforest commented 4 months ago

@JCZuurmond there's a feature in Databricks that never made it to public I think it was to build run time dataset dependency graph, and I was told by @FastLee that it is dependent on the DBR version. I believe the linted code is static lineage but I got some notebooks that are heavily parameterized.

And I agree for the planning part. Is there a roadmap on some planning UI? I can't imagine how someone would plan code migration using csv.

JCZuurmond commented 4 months ago

A planning UI is not on the roadmap. We are considering to include this issue in our upcoming planning, but no decision yet