Open jordandakota opened 1 year ago
Investigating now. Thanks for the report.
So I dug into this and we may be stuck for now. The syntax for declaring the comments at table creation requires knowing the list of all columns and their types. We are currently doing this with views by asking for a limit 0 of the view, but this works because the view is referencing the columns of existing tables. Once the table is created, we don't need to specify the type of a column to alter it, but there is currently no syntax for bulk updating columns after a table is created. We're looking into if we can have that added to our SQL, but it's going to be awhile, if we even get signoff for that change.
What we could do is give a config flag that basically says 'I promise to provide the correct column name and type for every column in my schema'. With this information, we can at least avoid per-column update on table creation, though we'd still have no option for incremental tables.
Don't use persist_docs and instead parse the comments from dbt model and use unity catalog rest api
@jordandakota which api do you have in mind? I went looking for this and I don't see any way to bulk update column comments.
There was a preview api for the Unity Catalog AI generated comments where you could get the comment and put the comment, but you could put any comment, it didn’t have to be the AI generated comment. It was asynchronous and didn’t run a SQL execute statement. Can’t seem to find the doc anymore.
Jordan From: Ben Cassell @.> Sent: Friday, November 17, 2023 12:01 PM To: databricks/dbt-databricks @.> Cc: Jordan Fox @.>; Mention @.> Subject: Re: [databricks/dbt-databricks] Performance improvement on persist_docs (Issue #507)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Don't use persist_docs and instead parse the comments from dbt model and use unity catalog rest api
@jordandakotahttps://github.com/jordandakota which api do you have in mind? I went looking for this and I don't see any way to bulk update column comments.
— Reply to this email directly, view it on GitHubhttps://github.com/databricks/dbt-databricks/issues/507#issuecomment-1816938224, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXVCFNYESCVVXZKG7PSFU3YE6YAJAVCNFSM6AAAAAA7NFRLF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJWHEZTQMRSGQ. You are receiving this because you were mentioned.Message ID: @.**@.>>
I have a change I'm working on now that will filter to only changed comments on incremental table updates. While that doesn't address the core issue (no bulk update), it should dramatically reduce the time on comment updates for incrementals, as I imagine most comments stay the same between incremental runs.
Hey @benc-db, any chance this is still being worked on?
Still pushing internally to get a bulk alter.
Switching to blocked. There are 2 possible changes internally (and I'm pushing for both), either of which could allow a significant improvement. I've also raised with the DBSQL team that adding column comments in general is unreasonably slow (architecturally, something causes it to scale with table size).
@benc-db any progress in this area? A SQL bulk alter for column descriptions would be nice, right now it is super inefficient. I would also love to be able to incorporate the Unity Catalog AI somehow as a model setting or have an option to persist existing comments during "create or replace". Then comments can be set manually in Unity Catalog and not be cleared during table runs.
Apologies for delay, was out on vacation. We are still waiting on a bulk API unfortunately. The other thing that would help us out here is if we could persist the comments on table create, rather than alter, since I suspect most changes after create are marginal. The thing that is blocking that is the ability to add comments to columns without knowing column type, which is the other change I'm pushing for.
Describe the feature
persist_docs currently sends an individual alter table statement per column comment. This takes forever for a large project.
Describe alternatives you've considered
Don't use persist_docs and instead parse the comments from dbt model and use unity catalog rest api Don't use persist_docs and instead parse the comments from dbt model and generate my own custom statements Don't use dbt comments at all and instead use unity catalog rest api to AI generate comments and persist those Most alternatives involve not using persist_docs or not using column level comments
Additional context
We have the ability to add column level comments at the create table step and through the rest api, individual alter statements are the worst alternative.
Who will this benefit?
Everyone using persist_docs that has more than a couple models with column level comments.
Are you interested in contributing this feature?
Not at this time.