databrickslabs / ucx

Automated migrations to Unity Catalog
Other
219 stars 75 forks source link

[FEATURE] Create a `ucx.workspace_scans` table #2600

Open JCZuurmond opened 1 week ago

JCZuurmond commented 1 week ago

Summary

Create a ucx.workspace_scans table to persist to the migration-process jobs in that scan the workspace to update the migration process status

Description

Table schema

Column name Data Type Comment
run_id INT The run id of the migration-process job run.
run_start_time TIMESTAMP The timestamp of the job run start.
workspace_id INT The workspace id in which the migration-process job ran.
tbd See comment below
JCZuurmond commented 1 week ago

@nfx and @asnare : In this table, we want to persist when a crawler did not return any objects to differentiate between a crawler not ran and no objects returned. How do you want to persist this information, i.e. what column schema?

JCZuurmond commented 1 week ago

Also, do we need to keep this in the ucx catalog? Or should we keep this in the hive_metastore?

ucx hive_metastore
Pro Contains data over multiple workspace scans to which this is relevant. Available during assessment
Con Growing the new ucx catalog; we might want to be conservative with this. Assessment should only be ran once, therefore irrelevant as it is only used when rerunning.
asnare commented 9 hours ago

When we discussed this at the office the intent was: This will be in the ucx catalog and not the hive_metastore. (So yes, it's not available during assessment.)

I think a technical driver for this choice is that this table will be updated quite often with small updates, and that will behave better on the ucx catalog.

The purpose of the table is also to allow for interpretation of the history table when a crawl produces no records. Without this table we won't be able to handle that situation.

I was going to update the snapshot() functionality of the crawlers to use this table as an optimisation: if the loader returns 0 rows, consult this table if available to determine whether to actually return 0 rows or (as is currently the case) perform a crawl.