databrickslabs / ucx

Automated migrations to Unity Catalog
Other
220 stars 76 forks source link

[FEATURE] Create a `ucx.history` table #2572

Open JCZuurmond opened 2 weeks ago

JCZuurmond commented 2 weeks ago

The history table should have the following schema:

Column name Data type Comment
workspace_id int The identifier of the workspace where this record was generated.
run_id int An identifier of the workflow run that generated this record.
run_as str The identity of the account that ran the workflow that generated this record.
run_start_time dt.datetime When this record was generated.
snapshot_id int An identifier that is unique to the records produced for a given snapshot.
ucx_version str The UCX semantic version
failures list[str] The list of problems associated with the object that this inventory record covers.
object_type str The inventory table for which this record was generated.
object_id list[str] The type-specific identifier for this inventory record.
object_data str Type-specific JSON-encoded data of the inventory record.
object_owner str The identity that has ownership of the object.
object_type_version int Versioning of inventory table, for forward compatibility. (Zero 0 if not specified.)
JCZuurmond commented 2 weeks ago

*Preferably the object data is stored as a struct, however, since each object type has different attributes this type needs updating when updating one of the ucx dataclasses representing the object OR when adding a new object

@nfx and @asnare : Please give your opinion here

JCZuurmond commented 2 weeks ago

To be determined: when do we create this table? The other tables are created during the databricks labs install ucx, i.e. the install.py in ucx. For a similar approach, I suggest creating this table also from a command line call and not in a workflow. It could become part of the #2571

asnare commented 1 week ago

Some notes on the schema:

asnare commented 6 days ago

Also, there needs to be a snapshot_id identifier (or something similar). Particularly for tables (and migration status) there are lots of crawls performed within the same workflow run.

JCZuurmond commented 6 days ago

We're missing an object_type_version field, to be used for forward compatibility. An integer that defaults to 0.

Our dataclasses do not have a version atm. And wouldn't the ucx version cover this?

JCZuurmond commented 6 days ago

| The owner is the account that is performing the crawl (and persisting the snapshot).?

Why not add the owner from the object?

asnare commented 6 days ago

We're missing an object_type_version field, to be used for forward compatibility. An integer that defaults to 0.

Our dataclasses do not have a version atm. And wouldn't the ucx version cover this?

Typically it's easier to version serialisation formats independently of the software version they're used in. (Corner cases with software upgrades and downgrades are also handled better.)

Technically we don't need a version for the first version (ie, what we're doing now) and can introduce the version field when it's needed. However schema changes are painful, so having the field from the start is worth the cost of having it from the start even though it's not needed yet.

asnare commented 6 days ago

| The owner is the account that is performing the crawl (and persisting the snapshot).?

Why not add the owner from the object?

That's also possible, assuming that all records in our inventory tables correspond to an object that has an owner.

Stepping back a bit, what's the purpose of having this field? I have it in my notes for the schema but what's the use-case here?

With that in mind:

This is different to the object_type and object_id fields:

JCZuurmond commented 6 days ago

| The owner is the account that is performing the crawl (and persisting the snapshot).? Why not add the owner from the object?

That's also possible, assuming that all records in our inventory tables correspond to an object that has an owner.

Stepping back a bit, what's the purpose of having this field? I have it in my notes for the schema but what's the use-case here?

With that in mind:

* owner as account-that-did-the-crawl is useful as (potentially diagnostic) metadata for understanding the privileges with which a crawl occurred, and if that's changed.

* owner as owner-of-the-underlying-object-that-the-record-refers-to is maybe redundant, because that information is also in the `object_data` field?

This is different to the object_type and object_id fields:

* `object_type` is used to discriminate the type of record, and essential (combined with the version field) for deserialising the `object_data` JSON.

* `object_id` is used to efficiently (combined with workspace and type) identify at the SQL layer the 'same' record across each run, without needing to first decode the (type-specific) `object_data` data.

I remember the purpose to be: with the owner field the people running the migration know who to notify about "their" objects going to be migrated.

owner as account-that-did-the-crawl is useful as (potentially diagnostic) metadata for understanding the privileges with which a crawl occurred, and if that's changed.

We could add another field for this, like: run_as

s maybe redundant, because that information is also in the object_data field? If it is redundant, then we do not need it

JCZuurmond commented 6 days ago

Decided to add both owners:

Will update the table above

asnare commented 5 days ago

Decided to add both owners:

👍

  • run_as str The identity of the account that ran the workflow that generated this record

Maybe to avoid confusion with the same property on workflows, snapshotted_as or snapshotted_by? There's probably a better name than this… past tense is nice.

asnare commented 5 days ago

Something that is ambiguous at the moment: is this history table intended to be compatible with all our crawlers, or just a subset?

At the very least it's intended to capture crawls from the migration-progress workflow. However there are other inventory tables that are updated/refreshed by other workflows:

(Even if we decide the journal isn't designed to capture all inventory classes, these should probably also be captured.)

Irrespective of this, should we ensure that the mechanism is robust enough for all the inventory classes? In particular, the history file will be shared across workspaces. Upgrades or changes to it will be very painful, particularly as it's not intended to be a file that is removed during uninstall (because it might be shared).

asnare commented 5 days ago

Something that is ambiguous at the moment: is this history table intended to be compatible with all our crawlers, or just a subset?

At the very least it's intended to capture crawls from the migration-progress workflow. However there are other inventory tables that are updated/refreshed by other workflows:

  • Permissions (migrate-groups)
  • ReconResult (migrate-data-reconciliation)

During some discussion we concluded that right now we only need to support the inventory types updated by the experimental migration-progress workflow. For the above classes:

We didn't cover this during the discussion but right now the DirectFsAccess tables are not part of the migration-progress workflow but it is the intention that these also become refreshable. So this class also needs to be covered by the history.

Irrespective of this, should we ensure that the mechanism is robust enough for all the inventory classes? In particular, the history file will be shared across workspaces. Upgrades or changes to it will be very painful, particularly as it's not intended to be a file that is removed during uninstall (because it might be shared).

This is still an open question.