Open JCZuurmond opened 2 weeks ago
*Preferably the object data is stored as a struct, however, since each object type has different attributes this type needs updating when updating one of the ucx dataclasses representing the object OR when adding a new object
@nfx and @asnare : Please give your opinion here
To be determined: when do we create this table? The other tables are created during the databricks labs install ucx
, i.e. the install.py
in ucx. For a similar approach, I suggest creating this table also from a command line call and not in a workflow. It could become part of the #2571
Some notes on the schema:
The object_type
will correspond to the type of the inventory record. I think we have 2 choices here:
tables
, jobs
, pipelines
, migration_status
, etc.Table
, JobInfo
, PipelineInfo
, TableMigrationStatus
, etc.My preference is for the former.
We're missing an object_type_version
field, to be used for forward compatibility. An integer that defaults to 0.
The object_id
field will contain a type-specific business key that identifies the inventory record. I'm okay with this being a string, but it could also be an array<string>
. (Our crawlers don't currently identify their business keys: this is something they need to be updated for.)
Something to consider is whether this should be optional or not. Traditionally, relation-modelling folk said that tables must have a primary key, but that turned out to be unpragmatic when you want to just dump a bunch of records into a table. Thoughts?
The object_data
field will be a string containing a JSON-encoded dictionary for the inventory record. Spark struct
support is static in nature, which means we can't use it here.
The owner
is the account that is performing the crawl (and persisting the snapshot).
Also, there needs to be a snapshot_id
identifier (or something similar). Particularly for tables (and migration status) there are lots of crawls performed within the same workflow run.
We're missing an object_type_version field, to be used for forward compatibility. An integer that defaults to 0.
Our dataclasses do not have a version atm. And wouldn't the ucx version cover this?
| The owner is the account that is performing the crawl (and persisting the snapshot).?
Why not add the owner from the object?
We're missing an object_type_version field, to be used for forward compatibility. An integer that defaults to 0.
Our dataclasses do not have a version atm. And wouldn't the ucx version cover this?
Typically it's easier to version serialisation formats independently of the software version they're used in. (Corner cases with software upgrades and downgrades are also handled better.)
Technically we don't need a version for the first version (ie, what we're doing now) and can introduce the version field when it's needed. However schema changes are painful, so having the field from the start is worth the cost of having it from the start even though it's not needed yet.
| The owner is the account that is performing the crawl (and persisting the snapshot).?
Why not add the owner from the object?
That's also possible, assuming that all records in our inventory tables correspond to an object that has an owner.
Stepping back a bit, what's the purpose of having this field? I have it in my notes for the schema but what's the use-case here?
With that in mind:
object_data
field?This is different to the object_type
and object_id
fields:
object_type
is used to discriminate the type of record, and essential (combined with the version field) for deserialising the object_data
JSON.object_id
is used to efficiently (combined with workspace and type) identify at the SQL layer the 'same' record across each run, without needing to first decode the (type-specific) object_data
data.| The owner is the account that is performing the crawl (and persisting the snapshot).? Why not add the owner from the object?
That's also possible, assuming that all records in our inventory tables correspond to an object that has an owner.
Stepping back a bit, what's the purpose of having this field? I have it in my notes for the schema but what's the use-case here?
With that in mind:
* owner as account-that-did-the-crawl is useful as (potentially diagnostic) metadata for understanding the privileges with which a crawl occurred, and if that's changed. * owner as owner-of-the-underlying-object-that-the-record-refers-to is maybe redundant, because that information is also in the `object_data` field?
This is different to the
object_type
andobject_id
fields:* `object_type` is used to discriminate the type of record, and essential (combined with the version field) for deserialising the `object_data` JSON. * `object_id` is used to efficiently (combined with workspace and type) identify at the SQL layer the 'same' record across each run, without needing to first decode the (type-specific) `object_data` data.
I remember the purpose to be: with the owner field the people running the migration know who to notify about "their" objects going to be migrated.
owner as account-that-did-the-crawl is useful as (potentially diagnostic) metadata for understanding the privileges with which a crawl occurred, and if that's changed.
We could add another field for this, like: run_as
s maybe redundant, because that information is also in the object_data field? If it is redundant, then we do not need it
Decided to add both owners:
object_owner
str The identity that has ownership of the object.run_as
str The identity of the account that ran the workflow that generated this recordWill update the table above
Decided to add both owners:
👍
run_as
str The identity of the account that ran the workflow that generated this record
Maybe to avoid confusion with the same property on workflows, snapshotted_as
or snapshotted_by
? There's probably a better name than this… past tense is nice.
Something that is ambiguous at the moment: is this history table intended to be compatible with all our crawlers, or just a subset?
At the very least it's intended to capture crawls from the migration-progress
workflow. However there are other inventory tables that are updated/refreshed by other workflows:
Permissions
(migrate-groups
)ReconResult
(migrate-data-reconciliation
)(Even if we decide the journal isn't designed to capture all inventory classes, these should probably also be captured.)
Irrespective of this, should we ensure that the mechanism is robust enough for all the inventory classes? In particular, the history
file will be shared across workspaces. Upgrades or changes to it will be very painful, particularly as it's not intended to be a file that is removed during uninstall (because it might be shared).
Something that is ambiguous at the moment: is this history table intended to be compatible with all our crawlers, or just a subset?
At the very least it's intended to capture crawls from the
migration-progress
workflow. However there are other inventory tables that are updated/refreshed by other workflows:
Permissions
(migrate-groups
)ReconResult
(migrate-data-reconciliation
)
During some discussion we concluded that right now we only need to support the inventory types updated by the experimental migration-progress workflow. For the above classes:
Permissions
will soon become unnecessary once API-based migration becomes generally available.ReconResult
should be covered by the history.We didn't cover this during the discussion but right now the DirectFsAccess
tables are not part of the migration-progress workflow but it is the intention that these also become refreshable. So this class also needs to be covered by the history.
Irrespective of this, should we ensure that the mechanism is robust enough for all the inventory classes? In particular, the
history
file will be shared across workspaces. Upgrades or changes to it will be very painful, particularly as it's not intended to be a file that is removed during uninstall (because it might be shared).
This is still an open question.
The history table should have the following schema:
0
if not specified.)