[FEATURE] Create a `ucx.history` table

JCZuurmond commented 2 weeks ago

The history table should have the following schema:

Column name	Data type	Comment
workspace_id	int	The identifier of the workspace where this record was generated.
run_id	int	An identifier of the workflow run that generated this record.
run_as	str	The identity of the account that ran the workflow that generated this record.
run_start_time	dt.datetime	When this record was generated.
snapshot_id	int	An identifier that is unique to the records produced for a given snapshot.
ucx_version	str	The UCX semantic version
failures	list[str]	The list of problems associated with the object that this inventory record covers.
object_type	str	The inventory table for which this record was generated.
object_id	list[str]	The type-specific identifier for this inventory record.
object_data	str	Type-specific JSON-encoded data of the inventory record.
object_owner	str	The identity that has ownership of the object.
object_type_version	int	Versioning of inventory table, for forward compatibility. (Zero `0` if not specified.)

JCZuurmond commented 2 weeks ago

*Preferably the object data is stored as a struct, however, since each object type has different attributes this type needs updating when updating one of the ucx dataclasses representing the object OR when adding a new object

@nfx and @asnare : Please give your opinion here

JCZuurmond commented 2 weeks ago

To be determined: when do we create this table? The other tables are created during the databricks labs install ucx, i.e. the install.py in ucx. For a similar approach, I suggest creating this table also from a command line call and not in a workflow. It could become part of the #2571

asnare commented 1 week ago

Some notes on the schema:

The object_type will correspond to the type of the inventory record. I think we have 2 choices here:
- We use the name of the inventory table. Examples include: tables, jobs, pipelines, migration_status, etc.
- We use the name of the data class. Corresponding examples include: Table, JobInfo, PipelineInfo, TableMigrationStatus, etc.
My preference is for the former.
We're missing an object_type_version field, to be used for forward compatibility. An integer that defaults to 0.
The object_id field will contain a type-specific business key that identifies the inventory record. I'm okay with this being a string, but it could also be an array<string>. (Our crawlers don't currently identify their business keys: this is something they need to be updated for.)

Something to consider is whether this should be optional or not. Traditionally, relation-modelling folk said that tables must have a primary key, but that turned out to be unpragmatic when you want to just dump a bunch of records into a table. Thoughts?
The object_data field will be a string containing a JSON-encoded dictionary for the inventory record. Spark struct support is static in nature, which means we can't use it here.
The owner is the account that is performing the crawl (and persisting the snapshot).

asnare commented 6 days ago

Also, there needs to be a snapshot_id identifier (or something similar). Particularly for tables (and migration status) there are lots of crawls performed within the same workflow run.

JCZuurmond commented 6 days ago

We're missing an object_type_version field, to be used for forward compatibility. An integer that defaults to 0.

Our dataclasses do not have a version atm. And wouldn't the ucx version cover this?

JCZuurmond commented 6 days ago

| The owner is the account that is performing the crawl (and persisting the snapshot).?

Why not add the owner from the object?

asnare commented 6 days ago

We're missing an object_type_version field, to be used for forward compatibility. An integer that defaults to 0.

Our dataclasses do not have a version atm. And wouldn't the ucx version cover this?

Typically it's easier to version serialisation formats independently of the software version they're used in. (Corner cases with software upgrades and downgrades are also handled better.)

Technically we don't need a version for the first version (ie, what we're doing now) and can introduce the version field when it's needed. However schema changes are painful, so having the field from the start is worth the cost of having it from the start even though it's not needed yet.

asnare commented 6 days ago

| The owner is the account that is performing the crawl (and persisting the snapshot).?

Why not add the owner from the object?

That's also possible, assuming that all records in our inventory tables correspond to an object that has an owner.

Stepping back a bit, what's the purpose of having this field? I have it in my notes for the schema but what's the use-case here?

With that in mind:

owner as account-that-did-the-crawl is useful as (potentially diagnostic) metadata for understanding the privileges with which a crawl occurred, and if that's changed.
owner as owner-of-the-underlying-object-that-the-record-refers-to is maybe redundant, because that information is also in the object_data field?

This is different to the object_type and object_id fields:

object_type is used to discriminate the type of record, and essential (combined with the version field) for deserialising the object_data JSON.
object_id is used to efficiently (combined with workspace and type) identify at the SQL layer the 'same' record across each run, without needing to first decode the (type-specific) object_data data.

JCZuurmond commented 6 days ago

| The owner is the account that is performing the crawl (and persisting the snapshot).? Why not add the owner from the object?

That's also possible, assuming that all records in our inventory tables correspond to an object that has an owner.

Stepping back a bit, what's the purpose of having this field? I have it in my notes for the schema but what's the use-case here?

With that in mind:
* owner as account-that-did-the-crawl is useful as (potentially diagnostic) metadata for understanding the privileges with which a crawl occurred, and if that's changed.

* owner as owner-of-the-underlying-object-that-the-record-refers-to is maybe redundant, because that information is also in the `object_data` field?
This is different to the object_type and object_id fields:
* `object_type` is used to discriminate the type of record, and essential (combined with the version field) for deserialising the `object_data` JSON.

* `object_id` is used to efficiently (combined with workspace and type) identify at the SQL layer the 'same' record across each run, without needing to first decode the (type-specific) `object_data` data.

I remember the purpose to be: with the owner field the people running the migration know who to notify about "their" objects going to be migrated.

owner as account-that-did-the-crawl is useful as (potentially diagnostic) metadata for understanding the privileges with which a crawl occurred, and if that's changed.

We could add another field for this, like: run_as

s maybe redundant, because that information is also in the object_data field? If it is redundant, then we do not need it

JCZuurmond commented 6 days ago

Decided to add both owners:

object_owner str The identity that has ownership of the object.
run_as str The identity of the account that ran the workflow that generated this record

Will update the table above

asnare commented 5 days ago

Decided to add both owners:

👍

run_as str The identity of the account that ran the workflow that generated this record

Maybe to avoid confusion with the same property on workflows, snapshotted_as or snapshotted_by? There's probably a better name than this… past tense is nice.

asnare commented 5 days ago

Something that is ambiguous at the moment: is this history table intended to be compatible with all our crawlers, or just a subset?

At the very least it's intended to capture crawls from the migration-progress workflow. However there are other inventory tables that are updated/refreshed by other workflows:

Permissions (migrate-groups)
ReconResult (migrate-data-reconciliation)

(Even if we decide the journal isn't designed to capture all inventory classes, these should probably also be captured.)

Irrespective of this, should we ensure that the mechanism is robust enough for all the inventory classes? In particular, the history file will be shared across workspaces. Upgrades or changes to it will be very painful, particularly as it's not intended to be a file that is removed during uninstall (because it might be shared).

asnare commented 5 days ago

Something that is ambiguous at the moment: is this history table intended to be compatible with all our crawlers, or just a subset?

At the very least it's intended to capture crawls from the migration-progress workflow. However there are other inventory tables that are updated/refreshed by other workflows:

Permissions (migrate-groups)

ReconResult (migrate-data-reconciliation)

During some discussion we concluded that right now we only need to support the inventory types updated by the experimental migration-progress workflow. For the above classes:

Permissions will soon become unnecessary once API-based migration becomes generally available.
ReconResult should be covered by the history.

We didn't cover this during the discussion but right now the DirectFsAccess tables are not part of the migration-progress workflow but it is the intention that these also become refreshable. So this class also needs to be covered by the history.

Irrespective of this, should we ensure that the mechanism is robust enough for all the inventory classes? In particular, the history file will be shared across workspaces. Upgrades or changes to it will be very painful, particularly as it's not intended to be a file that is removed during uninstall (because it might be shared).

This is still an open question.

databrickslabs / ucx

[FEATURE] Create a `ucx.history` table #2572