databrickslabs / ucx

Automated migrations to Unity Catalog
Other
240 stars 86 forks source link

[FEATURE] Create a `ucx.workflow_runs` table #2600

Closed JCZuurmond closed 2 months ago

JCZuurmond commented 2 months ago

Summary

Create a ucx.workflow_runs table to persist UCX workflow runs, e.g. assessment or migration-process.

Description

Table schema

Column name Data Type Comment
started_at dt.datetime The timestamp of the workflow run start.
finished_at dt.datetime The timestamp of the workflow run end.
workspace_id int The workspace id in which the workflow job ran.
workflow_name str The workflow name that ran, e.g. "migration-progress" or "assessment"
workflow_id int The workflow id of the workflow that ran.
workflow_run_id int The workflow run id.
run_as str The identity the workflow was run as`
status str The workflow status
JCZuurmond commented 2 months ago

@nfx and @asnare : In this table, we want to persist when a crawler did not return any objects to differentiate between a crawler not ran and no objects returned. How do you want to persist this information, i.e. what column schema?

JCZuurmond commented 2 months ago

Also, do we need to keep this in the ucx catalog? Or should we keep this in the hive_metastore?

ucx hive_metastore
Pro Contains data over multiple workspace scans to which this is relevant. Available during assessment
Con Growing the new ucx catalog; we might want to be conservative with this. Assessment should only be ran once, therefore irrelevant as it is only used when rerunning.
asnare commented 2 months ago

When we discussed this at the office the intent was: This will be in the ucx catalog and not the hive_metastore. (So yes, it's not available during assessment.)

I think a technical driver for this choice is that this table will be updated quite often with small updates, and that will behave better on the ucx catalog.

The purpose of the table is also to allow for interpretation of the history table when a crawl produces no records. Without this table we won't be able to handle that situation.

I was going to update the snapshot() functionality of the crawlers to use this table as an optimisation: if the loader returns 0 rows, consult this table if available to determine whether to actually return 0 rows or (as is currently the case) perform a crawl.

JCZuurmond commented 2 months ago

Updating the table above according to: https://github.com/databrickslabs/ucx/pull/2744#discussion_r1775151664

JCZuurmond commented 2 months ago

The workflow_run_id is to be used for joining with historical records, create a view to pre-join these. i think you called it "snapshot" or something:

WITH level0 AS (
  SELECT workspace_id, workflow_id, MAX(finished_at) AS finished_at FROM workflow_runs GROUP BY 1,2
), level1 AS (
  SELECT * FROM level0 LEFT JOIN workflow_runs USING (workspace_id, workflow_id, finished_at)
)
SELECT * FROM level1
asnare commented 2 months ago

The specification of this feature at the moment is insufficient, because we can have situations where a workflow is associated with more than one crawls of an inventory table; the current requirements assume it's a 1:{0,1} relationship but really it's 1:n.

This happens with the migration_status table:

Once the history table is available, all crawls need to be captured or it becomes inconsistent and that will lead to problems.

As such:

asnare commented 2 months ago

It's also worth noting that one of the purposes of this table is to detect situations where a crawl produced no records: in this case the current view of the snapshot needs to be empty.

JCZuurmond commented 2 months ago

@nfx : Could you triage?

nfx commented 2 months ago

@asnare answered in https://github.com/databrickslabs/ucx/pull/2743#pullrequestreview-2341125401

asnare commented 2 months ago

Okay, I think I'm now understanding where some of the confusion has arisen. Based on the discussions when originally designing the feature, I have been assuming that a goal of the history journal is to capture each update for a specific subset of the inventory tables. However based on these answers the code snippets from https://github.com/databrickslabs/ucx/pull/2743#discussion_r1783326822, it seems that the intention is:

Somewhat related:

@nfx: Is this correct?

nfx commented 2 months ago

The journal and workflow runs tables are intended to only capture snapshot data produced when crawling as part of the experimental-migration-progress workflow.

correct

The implementation will be as a set of wrappers around the crawlers; the crawlers themselves will be unaware of the history mechanism.

this seems to be the least invasive option, otherwise we have issues with DFSA & migration status

The journal content will not, as a design principle, be consistent with the content of the inventory tables: in particular we accept that inventory updates performed by other workflows (including table-migration and experimental-workflow-linter) are not captured by the history table.

we just need to modify tasks to use history log from those workflows and then journal is consistent. this way it's explicit

There is some technical debt relating to detection of intentionally-empty-vs-uncrawled inventory tables, which in particular affect the DFSA crawlers. We had discussed (partially) addressing this using the history/runs tables. At present however this will not be the case.

every workflow invocation we'll have the record in runs table and empty crawl result would fit well

asnare commented 2 months ago

I think we're getting closer here.

The journal content will not, as a design principle, be consistent with the content of the inventory tables: in particular we accept that inventory updates performed by other workflows (including table-migration and experimental-workflow-linter) are not captured by the history table.

we just need to modify tasks to use history log from those workflows and then journal is consistent. this way it's explicit

Inside the table-migration workflow we have a task (migrate-views) that internally forces repeated snapshots of the migration_status inventory. Irrespective of whether we explicitly capture into the history or not, with the proposed design we cannot both:

Options that spring to mind are:

@nfx: Have I missed anything here?

nfx commented 2 months ago

Relax the consistency requirement. This gets us closest to the proposed design by letting us explicitly capture only at the task level, and ignore the intermediate (internal) snapshots that this workflow performs.

this is the preferred option