FirebirdSQL / firebird

Firebird server, client and tools
https://firebirdsql.org
1.26k stars 217 forks source link

A data source for JOIN that is result of aggregating must be used only one time [CORE3013] #3394

Open firebird-automations opened 14 years ago

firebird-automations commented 14 years ago

Submitted by: @pavel-zotov

Consider two variants of query against table EMPLOYEE of "standard" database that is supplied with any FB.

Both variant results the rows with EMP_NO and such SALARY that are greater than average salary of all employees. Also we call generator emp_no_gen to determine the quontity of operations that occures during this action: 1) via NON-correlated subquery: select e.emp_no, e.salary, gen_id(emp_no_gen,0) g from employee e where (e.salary > (select avg(ex.salary)+0*gen_id(emp_no_gen,1) avgval from employee ex))

Sequential running of this variant shows that EMP_NO_GEN increases by 1 at each time. So we can be sure that aggregate part (inside WHERE) is evaluated only once.

2) via JOIN: select e.emp_no, e.salary, gen_id(emp_no_gen,0) g from employee e join (select avg(ex.salary)+0*gen_id(emp_no_gen,1) avgval from employee ex) ea on e.salary > ea.avgval

In this case the EMP_NO_GEN increases by 42 each time (i.e. by step that is equal select count(*) from emp).

I think it will be useful to make Firebird "understand" that such part as 2nd in this join must be evaluated only once, like in variant 1.

firebird-automations commented 14 years ago

Commented by: @asfernandes

You shouldn't use GEN_ID as such and expect consistent results.

For the case you want, FB 3 has window functions, see it in my blog http://asfernandes.blogspot.com.

For now, one way to do that type of thing is with selectable procedures.

firebird-automations commented 14 years ago

Commented by: @pavel-zotov

Adriano,

I'm talking only about FB optimizer and how to make it much smarter. GEN_ID has been used in these samples only for illustration that unnecessary work is done in case of JOIN.

I know that some heuristic rules are placed inside opt.cpp and propose that these rules can be improved.

firebird-automations commented 14 years ago

Commented by: @pavel-zotov

PS. OFFTOP. Just finished read your blog about windowed function. This is THE GREAT WORK! Especially running totals (sum(salary) over (order by salary)) and navigation via LAG / LEAD. Thank you in advance.

firebird-automations commented 14 years ago
Modified by: @dyemanov assignee: Dmitry Yemanov \[ dimitr \]
firebird-automations commented 14 years ago
Modified by: @dyemanov Version: 3\.0 Initial \[ 10301 \] =\>
firebird-automations commented 10 years ago
Modified by: @dyemanov status: Open \[ 1 \] =\> In Progress \[ 3 \]
firebird-automations commented 10 years ago
Modified by: @dyemanov Fix Version: 3\.0 Beta 2 \[ 10586 \]
firebird-automations commented 9 years ago
Modified by: @dyemanov Fix Version: 3\.0 RC 1 \[ 10584 \] Fix Version: 3\.0 Beta 2 \[ 10586 \] =\>
firebird-automations commented 9 years ago
Modified by: @dyemanov status: In Progress \[ 3 \] =\> Open \[ 1 \]
firebird-automations commented 9 years ago
Modified by: @dyemanov Fix Version: 3\.0\.0 \[ 10048 \] Fix Version: 3\.0 RC 1 \[ 10584 \] =\>
firebird-automations commented 8 years ago

Commented by: @dyemanov

I suggest this ticket to be either rewritten or closed in favor of the new one, the current description does not match the test case.

Look at the plan:

Select Expression -> Filter -> Nested Loop Join (inner) -> Aggregate -> Table "EMPLOYEE" as "EA EX" Full Scan -> Table "EMPLOYEE" as "E" Full Scan

Obviously, the aggregated derived table is composed and accessed just once. However, the join condition is evaluated for the every row inside the inner stream E. And the "+0*GEN_ID" part (as well as any other expression) is evaluated as many times too, hence the visible effect.

For the first test case, result of the singleton invariant subquery is cached after its first evaluation. Caching/materializing a scalar value or a result set are completely different things, they must not be compared with each other.

firebird-automations commented 8 years ago

Commented by: @pavel-zotov

From what time FB start to do somewhat like 'cache' (or 'remember') results of aggregating ? I just tried 2.5.0 (yes, release of october-2010) and got:

alter sequence emp_no_gen restart with 0;

select e.emp_no, e.salary, gen_id(emp_no_gen,0) g from employee e join ( select avg(ex.salary + 0 * g ) avgval from ( select salary, gen_id(emp_no_gen,1) g from employee ) ex ----------- CHANGED init. query here: moved gen_id() inside DT! ) ea on e.salary > ea.avgval ;

Output:

EMP_NO SALARY G ======= ===================== ===================== 110 6000000.00 42 118 7480000.00 42 121 99000000.00 42

Trace:

PLAN JOIN (EA EX EMPLOYEE NATURAL, E NATURAL) 3 records fetched 0 ms, 2 read(s), 223 fetch(es), 42 mark(s)

Table Natural Index **************************************************** EMPLOYEE 84

So, perhaps... no problem at all ?.. But I'm sure that at least on 2.5.x I did see many times inefficient statistics per tables...

firebird-automations commented 8 years ago

Commented by: @dyemanov

> From what time FB start to do somewhat like 'cache' (or 'remember') results of aggregating ?

It never did. But the first stream of the nested loop join is always accessed once.

> So, perhaps... no problem at all ?

The problem (stated in the ticket title) exists and I know what exactly you had in mind. But this ticket is plain wrong, maybe just a badly though sample. I will create a proper one (or maybe change this one) a bit later.

firebird-automations commented 8 years ago
Modified by: @dyemanov Fix Version: 3\.0 RC2 \[ 10048 \] =\>