desc: type descriptor to type conversion that is performed during verbose logging is too expensive

cucaroach commented 1 year ago

When the optbuilder needs a *types.T for an enum it asks the schema resolver for the same information over and over, for user defined types (enums in particular) this is more expensive than it should be as we reconstruct the type including type metadata which is O(n) where n is enum size.

https://github.com/cockroachlabs/support/issues/2537 describes this issue well.

pprof view

![Screenshot 2023-08-21 at 6 19 35 PM](https://github.com/cockroachdb/cockroach/assets/80277990/a5076335-d780-4823-a7db-323a9dd160f4)

We should do some caching of this conversion, or at least don't recreate the metadata every time. A simple solution could be to just copy the pointers to the logical and physical rep slices out of the type descriptor instead of doing a deep copy.

Jira issue: CRDB-30835

rafiss commented 1 year ago

If we add some caching here, it seems like there will still be a fair amount of expensive work happening, since the optbuilder will still be calling parser.ParseExpr many times.

Xiang-Gu commented 1 year ago

@cucaroach My understanding is that inserting to a table with outbound FK requires us to validate that this to-be-inserted value satisfies this FK constraint. This is done by scanning the other, referenced table (func (b *Builder) buildScan).

This buildScan seems to be very general purpose, and among other things, it builds all check constraints into the table's metadata (here). Then, if there is any column of user-defined types, it's considered as a synthesized check (i.e. a user-defined column is equivalent to a check constraint col IN ('a', 'b', 'c')) and thus we will need to "resolve" it (i.e. ensure this expr type is BOOL). Among other things, "resolving" this col IN ('a', 'b', 'c') expr requires that all values in the right-hand side have the same type, and therefore we are going to type check each value (i.e. 'a', 'b', and 'c') and ensure they are all of the same user-defined type (here).

I didn't know/understand why caching in the resolver layer is the answer. As far as I know, each call into type checking each enum value is very fast (0.001ms-0.010ms) and we already have some caching in place to achieve this. For example, part of work in this type checking process is to retrieve the type descriptor, its parent schema, and its parent database descriptors. All those are cached in the leased descriptor layer. Another example is that this type checking also involves "type hydration" (here) -- filling in *types.T.TypeMeta for user-defined ENUM types -- and this is short-circuited as well.

Overall, just like @rafiss said, the optbuilder is going to type-check each and every enum value and that's the O(n) part. I think the answer should more likely be in the optbuilder to avoid all these calls, because after all, our job here is to just scan the referenced table to determine whether an insert (on the original, referencing table) should succeed, and we can probably achieve that without resolving any check constraints on the referenced table. But I am no expert here and I acknowledge that maybe it's hard to "special-case" buildScan for this particular case. In that case, we can have more conversations about this; I just don't think adding caching in the resolver layer is the right approach.

rafiss commented 1 year ago

@cucaroach could you either share the raw profile, or else add a new screenshot that does not have some of the call stack covered by that tooltip?

cucaroach commented 1 year ago

Ignore that ParseExpr time, that was from a bunk experiment where I was re-parsing the schema everytime, below is just running the query.

profile.pb.gz

cucaroach commented 1 year ago

Here's another with tracing enabled. I present this as evidence that just dealing with the enum metadata is probably not enough. The tracing of the getDescriptorsByID call is almost as expensive!

profile (1).pb.gz

rafiss commented 1 year ago

Just to note - in that last one, it looks like the main reason getDescriptorsByID is expensive is because of its call to log.Eventf, and that only occurs if tracing / expensive logging is enabled: https://github.com/cockroachdb/cockroach/blob/58ace12718a43b05858e0be9693856977d8f5370/pkg/sql/catalog/descs/descriptor.go#L133-L138

cucaroach commented 1 year ago

Just to note - in that last one, it looks like the main reason getDescriptorsByID is expensive is because of its call to log.Eventf, and that only occurs if tracing / expensive logging is enabled:

Right but when a query is slow we always ask for a statement bundle which enables this stuff so its important for performance of tracing/expensive logging to be as minimal as possible. We frequently will optimize things to make sure traced request aren't too far a drift of untraced execution to make sure we're optimizing the right things. Like if we aren't gonna optimize away this log call with a cache or something I would argue we should remove it. Its basically skewing things to make it look like we're spending more time in planning than we really are.

cucaroach commented 1 year ago

https://github.com/cockroachlabs/support/issues/2557 is an example where this is an issue that doesn't involve enums.

rafiss commented 1 year ago

in that case, https://github.com/cockroachlabs/support/issues/2557 likely has an entirely different cause of slowness. it should be investigated as a separate issue.

mgartner commented 1 year ago

So should we close this issue, or is it relevant?

rafiss commented 1 year ago

this is relevant. https://github.com/cockroachdb/cockroach/pull/109394 fixes most of the slowness, but this is still open because of the point raised above: https://github.com/cockroachdb/cockroach/issues/109228#issuecomment-1690954943

rafiss commented 1 year ago

This is a SQL Foundations issue.

cockroachdb / cockroach

desc: type descriptor to type conversion that is performed during verbose logging is too expensive #109228