[FEATURE] Create a `ucx.workspace_scans` table

databrickslabs / ucx

Automated migrations to Unity Catalog

Other

219 stars 75 forks source link

[FEATURE] Create a `ucx.workspace_scans` table #2600

Open JCZuurmond opened 1 week ago

JCZuurmond commented 1 week ago

Summary

Create a ucx.workspace_scans table to persist to the migration-process jobs in that scan the workspace to update the migration process status

Description

Table schema

Column name	Data Type	Comment
`run_id`	`INT`	The run id of the `migration-process` job run.
`run_start_time`	`TIMESTAMP`	The timestamp of the job run start.
`workspace_id`	`INT`	The workspace id in which the `migration-process` job ran.
tbd		See comment below

JCZuurmond commented 1 week ago

@nfx and @asnare : In this table, we want to persist when a crawler did not return any objects to differentiate between a crawler not ran and no objects returned. How do you want to persist this information, i.e. what column schema?

JCZuurmond commented 1 week ago

Also, do we need to keep this in the ucx catalog? Or should we keep this in the hive_metastore?

	`ucx`	`hive_metastore`
Pro	Contains data over multiple workspace scans to which this is relevant.	Available during assessment
Con	Growing the new ucx catalog; we might want to be conservative with this.	Assessment should only be ran once, therefore irrelevant as it is only used when rerunning.

asnare commented 9 hours ago

When we discussed this at the office the intent was: This will be in the ucx catalog and not the hive_metastore. (So yes, it's not available during assessment.)

I think a technical driver for this choice is that this table will be updated quite often with small updates, and that will behave better on the ucx catalog.

The purpose of the table is also to allow for interpretation of the history table when a crawl produces no records. Without this table we won't be able to handle that situation.

I was going to update the snapshot() functionality of the crawlers to use this table as an optimisation: if the loader returns 0 rows, consult this table if available to determine whether to actually return 0 rows or (as is currently the case) perform a crawl.