Closed JCZuurmond closed 2 months ago
@nfx and @asnare : In this table, we want to persist when a crawler did not return any objects to differentiate between a crawler not ran and no objects returned. How do you want to persist this information, i.e. what column schema?
Also, do we need to keep this in the ucx
catalog? Or should we keep this in the hive_metastore
?
ucx |
hive_metastore |
|
---|---|---|
Pro | Contains data over multiple workspace scans to which this is relevant. | Available during assessment |
Con | Growing the new ucx catalog; we might want to be conservative with this. | Assessment should only be ran once, therefore irrelevant as it is only used when rerunning. |
When we discussed this at the office the intent was: This will be in the ucx
catalog and not the hive_metastore
. (So yes, it's not available during assessment.)
I think a technical driver for this choice is that this table will be updated quite often with small updates, and that will behave better on the ucx
catalog.
The purpose of the table is also to allow for interpretation of the history table when a crawl produces no records. Without this table we won't be able to handle that situation.
I was going to update the snapshot()
functionality of the crawlers to use this table as an optimisation: if the loader returns 0 rows, consult this table if available to determine whether to actually return 0 rows or (as is currently the case) perform a crawl.
Updating the table above according to: https://github.com/databrickslabs/ucx/pull/2744#discussion_r1775151664
The workflow_run_id
is to be used for joining with historical records, create a view to pre-join these. i think you called it "snapshot" or something:
WITH level0 AS (
SELECT workspace_id, workflow_id, MAX(finished_at) AS finished_at FROM workflow_runs GROUP BY 1,2
), level1 AS (
SELECT * FROM level0 LEFT JOIN workflow_runs USING (workspace_id, workflow_id, finished_at)
)
SELECT * FROM level1
The specification of this feature at the moment is insufficient, because we can have situations where a workflow is associated with more than one crawls of an inventory table; the current requirements assume it's a 1:{0,1} relationship but really it's 1:n.
This happens with the migration_status
table:
experimental-migration-progress
workflow.table-migration
workflow.Once the history table is available, all crawls need to be captured or it becomes inconsistent and that will lead to problems.
As such:
snapshot_id
column, for joining against the snapshot_id
column on the history
table.workflow_id
column. Without this the view will return the records for all the crawls within the latest workflow run, not just the latest crawl.It's also worth noting that one of the purposes of this table is to detect situations where a crawl produced no records: in this case the current view of the snapshot needs to be empty.
@nfx : Could you triage?
@asnare answered in https://github.com/databrickslabs/ucx/pull/2743#pullrequestreview-2341125401
Okay, I think I'm now understanding where some of the confusion has arisen. Based on the discussions when originally designing the feature, I have been assuming that a goal of the history journal is to capture each update for a specific subset of the inventory tables. However based on these answers the code snippets from https://github.com/databrickslabs/ucx/pull/2743#discussion_r1783326822, it seems that the intention is:
experimental-migration-progress
workflow.table-migration
and experimental-workflow-linter
) are not captured by the history table.Somewhat related:
@nfx: Is this correct?
The journal and workflow runs tables are intended to only capture snapshot data produced when crawling as part of the experimental-migration-progress workflow.
correct
The implementation will be as a set of wrappers around the crawlers; the crawlers themselves will be unaware of the history mechanism.
this seems to be the least invasive option, otherwise we have issues with DFSA & migration status
The journal content will not, as a design principle, be consistent with the content of the inventory tables: in particular we accept that inventory updates performed by other workflows (including table-migration and experimental-workflow-linter) are not captured by the history table.
we just need to modify tasks to use history log from those workflows and then journal is consistent. this way it's explicit
There is some technical debt relating to detection of intentionally-empty-vs-uncrawled inventory tables, which in particular affect the DFSA crawlers. We had discussed (partially) addressing this using the history/runs tables. At present however this will not be the case.
every workflow invocation we'll have the record in runs table and empty crawl result would fit well
I think we're getting closer here.
The journal content will not, as a design principle, be consistent with the content of the inventory tables: in particular we accept that inventory updates performed by other workflows (including table-migration and experimental-workflow-linter) are not captured by the history table.
we just need to modify tasks to use history log from those workflows and then journal is consistent. this way it's explicit
Inside the table-migration
workflow we have a task (migrate-views
) that internally forces repeated snapshots of the migration_status
inventory. Irrespective of whether we explicitly capture into the history or not, with the proposed design we cannot both:
Options that spring to mind are:
snapshot_id
: to distinguish captured snapshots for a given type within a workflow run.)@nfx: Have I missed anything here?
Relax the consistency requirement. This gets us closest to the proposed design by letting us explicitly capture only at the task level, and ignore the intermediate (internal) snapshots that this workflow performs.
this is the preferred option
Summary
Create a
ucx.workflow_runs
table to persist UCX workflow runs, e.g.assessment
ormigration-process
.Description
Table schema
started_at
dt.datetime
finished_at
dt.datetime
workspace_id
int
workflow_name
str
"migration-progress"
or"assessment"
workflow_id
int
workflow_run_id
int
run_as
str
status
str