sql: Extra string allocations in EXTRACT handling

solongordon commented 7 years ago

See discussion at https://reviewable.io/reviews/cockroachdb/cockroach/19923#-KyS_gHiEzAfKPa0fg4X:-Ky_Yc13IQxe498kphDh:b708j2.

In our handling for the extract built-in, we do string normalization on the precision argument. @justinj and @knz pointed out that there are some unnecessary string allocations here which could be reduced.

Also applies to extract_duration and date_trunc.

Jira issue: CRDB-5946

jordanlewis commented 3 years ago

In general, I think we could fix this and other similar situations by introducing a way to normalize scalar builtin functions and their arguments.

We could have a way of defining a rule that looks at a scalar builtin and lets you normalize its arguments. But I'm not exactly sure whether this is going to be worth doing - how many circumstances of this nature really are there?

It turns out that TPCH query 7 actually uses extract, so I'm going to run a test with that query and see if the extra allocations in this issue actually make a difference or not.

jordanlewis commented 3 years ago

cc @otan @knz - why were we planning to remove extract_duration in 20.2? I see a TODO to that effect that I don't understand.

jordanlewis commented 3 years ago

Using the simpler query

select extract('YeAr' from l_commitdate) as f from lineitem group by f;

we see about 30% of the program's allocations in strings.ToLower :) Though we also see 30% just returning the new float that represents the year, so it's not totally clear that this is such a big deal still.

Still, it really would be nice to have a way to normalize the input arguments to builtins before running them once per every row... it feels like we should be able to somehow define that this function always lower()'s its first argument via the optimizer. Then, since lower() is an immutable function, it'll always get constant folded via the pre-existing constant folding rules 🤔

@cockroachdb/sql-optimizer any thoughts on this? It's not urgent but just curious.

jordanlewis commented 3 years ago

Since extract is super weird, and completely handled by the parser, we could theoretically just tell the parser to add a lower() to the first argument unconditionally, but that seems inflexible, because there are other functions like date_trunc that have this same property that don't get special handling via the parser.

RaduBerinde commented 3 years ago

We could have a crdb_internal_extract_from_lower variant and a normalization rule that converts extract(x) to crdb_internal_extract_from_lower(lower(x)).

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

XiaochenCui commented 3 months ago

This is useful for extract since there are some test cases with uppercase arguments. I'll implement this later and provide a performance report.

More general idea:

I'm also exploring the possibilities of making this a general improvement for all built-in functions. The idea is to fold constants during optimization. However, it seems complicated the issue:

If we translate extract('year', ...) to extract(lower('year'), ...), it will result in significantly more CPU instructions than the original function, which is not good.
If we mark the first argument of extract as lower, it increases the code complexity a lot with little performance improvement.

Issues with the function extract:

By the way, I found that there are many test cases with unquoted arguments:

SELECT extract(hour from timetz '12:01:02.345678-07'), where "hour" is a constant.
SELECT extract(element, input::timestamp), where "element" is a column.

So what are the rules for argument resolution? What if there is a column named "hour"?

cockroachdb / cockroach

sql: Extra string allocations in EXTRACT handling #19965