Closed jfinzel closed 6 years ago
Here is another case. Original query actually times out after 12 hours (lots of data here). Obvious missing customer_id index in the left join lateral. It has to do a seq scan in each loop:
select
ltrim(ba.foo_number :: text,'0'::text) as foo_number, ba.field_1::text
,string_agg(distinct l.customer_id::text, '|') as value
from bars.table_2 ba
inner join customers c on c.person_id= ba.person_id
left join lateral (select l.customer_id as customer_id
from oreos l
where l.customer_id = c.id
and l.status_cd in ('happy','happy_super','in_sad','in_sad_super','woohoo')
limit 1
) l on true
where nullif(ltrim(ba.foo_number :: text,'0'::text),'') is not null and ba.field_1 is not null
group by 1,2;
Plan (dexter says no indexes found):
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=632357813747.76..632358072383.43 rows=3497595 width=96)
Group Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
-> Sort (cost=632357813747.76..632357848109.72 rows=13744783 width=68)
Sort Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
-> Nested Loop Left Join (cost=417941.32..632355620388.12 rows=13744783 width=68)
-> Gather (cost=417941.32..3380621.18 rows=13744783 width=20)
Workers Planned: 3
-> Hash Join (cost=416941.32..2005142.88 rows=4433801 width=20)
Hash Cond: (c.person_id = ba.person_id)
-> Parallel Seq Scan on customers c (cost=0.00..513188.08 rows=3443908 width=8)
-> Hash (cost=266275.53..266275.53 rows=8206463 width=20)
-> Seq Scan on table_2 ba (cost=0.00..266275.53 rows=8206463 width=20)
Filter: ((field_1 IS NOT NULL) AND (NULLIF(ltrim((foo_number)::text, '0'::text), ''::text) IS NOT NULL))
-> Limit (cost=0.00..46006.69 rows=1 width=4)
-> Seq Scan on oreos l (cost=0.00..690100.31 rows=15 width=4)
Filter: ((customer_id = c.id) AND ((status_cd)::text = ANY ('{happy,happy_super,in_sad,in_sad_super,woohoo}'::text[])))
(16 rows)
After I manually create index on customer_id:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=71826098.56..72084734.24 rows=3497595 width=96)
Group Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
-> Sort (cost=71826098.56..71860460.52 rows=13744784 width=68)
Sort Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
-> Nested Loop Left Join (cost=417941.75..69632738.79 rows=13744784 width=68)
-> Gather (cost=417941.32..3380624.29 rows=13744784 width=20)
Workers Planned: 3
-> Hash Join (cost=416941.32..2005145.89 rows=4433801 width=20)
Hash Cond: (c.person_id = ba.person_id)
-> Parallel Seq Scan on customers c (cost=0.00..513189.15 rows=3443915 width=8)
-> Hash (cost=266275.53..266275.53 rows=8206463 width=20)
-> Seq Scan on table_2 ba (cost=0.00..266275.53 rows=8206463 width=20)
Filter: ((field_1 IS NOT NULL) AND (NULLIF(ltrim((foo_number)::text, '0'::text), ''::text) IS NOT NULL))
-> Limit (cost=0.43..4.80 rows=1 width=4)
-> Index Scan using oreos_customer_id_idx on oreos l (cost=0.43..65.87 rows=15 width=4)
Index Cond: (customer_id = c.id)
Filter: ((status_cd)::text = ANY ('{happy,happy_super,in_sad,in_sad_super,woohoo}'::text[]))
(17 rows)
Thanks @jfinzel, can you share the --log-level debug2
output for each of them?
It looks like you are close on both counts - would love if cost savings is configurable! Very interested why the second query here shows the massive cost savings but doesn't do it.
Last analyze: cnu.foo_info : 2018-02-12T12:45:09-06:00
Last analyze: cnu.customers : 2018-02-12T12:44:40-06:00
Last analyze: bars.table_2 : 2018-02-12T12:44:59-06:00
Last analyze: bars.table_1 : 2018-02-12T12:44:56-06:00
Last analyze: john.people : 2018-02-12T12:45:52-06:00
No new indexes found
--------------------------------------------------------------------------------
Query 0200d1693157e91f1b595826c1df124ea9e56c5832
Start: 1464495.4
Pass1: 961150.04 : bars.table_1 (created_on)
Pass2: 961150.04 : bars.table_1 (created_on)
Final: 1464495.4 : None
Need 50% cost savings to suggest index
select
p.id as person_id,
count(distinct bi.customer_name) as value
from people p
join customers c on c.person_id=p.id
join bars.table_1 dih on dih.customer_id=c.id
join bars.table_2 ba on ba.bar_id=dih.bar_id
join foo_info bi on bi.field_1 = ba.field_1 and bi.customer_name is not null
where dih.bar_type_id=2
and dih.created_on>current_timestamp-interval'30 days'
group by 1
;
Last analyze: cnu.customers : 2018-02-12T12:44:40-06:00
Last analyze: cnu.oreos : 2018-02-12T12:46:57-06:00
Last analyze: bars.table_2 : 2018-02-12T12:44:59-06:00
No new indexes found
--------------------------------------------------------------------------------
Query 022ca61698c1c6a1d725fd73d93db465275781dfdc
Start: 576982091948.52
Pass1: 61343598.12 : cnu.oreos (customer_id), bars.table_2 (person_id)
Pass2: 61343598.12 : cnu.oreos (customer_id), bars.table_2 (person_id)
Final: 576982091948.52 : None
select
ltrim(ba.foo_number :: text,'0'::text) as foo_number, ba.field_1::text
,string_agg(distinct l.customer_id::text, '|') as value
from bars.table_2 ba
inner join customers c on c.person_id= ba.person_id
left join lateral (select l.customer_id as customer_id
from oreos l
where l.customer_id = c.id
and l.status_cd in ('happy','happy_super','in_sad','in_sad_super','woohoo')
limit 1
) l on true
where nullif(ltrim(ba.foo_number :: text,'0'::text),'') is not null and ba.field_1 is not null
group by 1,2;
For query 1, it may make sense to have an absolute cost savings in addition to a percentage, as 30% can be significant if a query has a very high cost.
For query 2, the cost savings is driven by two indexes, and Dexter doesn't have a good way to figure out if both are really needed at the moment. This has led to poor index suggestions in my experience, so it won't suggest for now. Definitely an area for improvement.
@ankane regarding case 2, I would suggest that dexter tries adding the indexes one at a time and checking cost. I can tell you the customer_id index is 1000x more important than the person_id index here. But you could tell that very easily if you added them in order.
Also, there are certainly going to be cases where a query could benefit from more than 1 index being added. I haven't yet provided you with some of our queries that have 20+ joins :).
Interesting to note and perhaps worthy of documentation? - if I lower random_page_cost to 2, which is actually more appropriate for the system, it creates the index. It is worth mentioning that because the indexes are cost-based, it is important that your cost settings are appropriately tuned.
EDIT: Regarding cost savings, agreed, but it should be configurable in any case because depending on your environment, you may want to tune this value up or down depending on how critical read vs. write performance is.
@ankane any update or at least ETA on this?
Hey @jfinzel, both cases should be addressed now. There's now a --min-cost-savings-pct
option.
dexter --min-cost-savings-pct 30 ...
Let me know how it goes.
Hey @jfinzel, have you had a chance to try the latest version?
@ankane I have been busy but am now getting around to this. I have updated to latest version and will let you know how it goes. Many thanks again!
Hey @jfinzel, going to close this out. Let me know if you find anything.
Here is one case of a query with an index not being picked up. I was expecting an index on a timestamp field which filters out over 90% of the data.
SQL:
Plan (it said no indexes found):
Plan after I created an index on created_on: