Open JCZuurmond opened 1 week ago
@nfx and @asnare : In this table, we want to persist when a crawler did not return any objects to differentiate between a crawler not ran and no objects returned. How do you want to persist this information, i.e. what column schema?
Also, do we need to keep this in the ucx
catalog? Or should we keep this in the hive_metastore
?
ucx |
hive_metastore |
|
---|---|---|
Pro | Contains data over multiple workspace scans to which this is relevant. | Available during assessment |
Con | Growing the new ucx catalog; we might want to be conservative with this. | Assessment should only be ran once, therefore irrelevant as it is only used when rerunning. |
When we discussed this at the office the intent was: This will be in the ucx
catalog and not the hive_metastore
. (So yes, it's not available during assessment.)
I think a technical driver for this choice is that this table will be updated quite often with small updates, and that will behave better on the ucx
catalog.
The purpose of the table is also to allow for interpretation of the history table when a crawl produces no records. Without this table we won't be able to handle that situation.
I was going to update the snapshot()
functionality of the crawlers to use this table as an optimisation: if the loader returns 0 rows, consult this table if available to determine whether to actually return 0 rows or (as is currently the case) perform a crawl.
Summary
Create a
ucx.workspace_scans
table to persist to themigration-process
jobs in that scan the workspace to update the migration process statusDescription
Table schema
run_id
INT
migration-process
job run.run_start_time
TIMESTAMP
workspace_id
INT
migration-process
job ran.