Indexes with clear benefit not picked up

jfinzel commented 6 years ago

Here is one case of a query with an index not being picked up. I was expecting an index on a timestamp field which filters out over 90% of the data.

SQL:

select
    p.id as person_id,
    count(distinct bi.customer_name) as value
from people p
join customers c on c.person_id=p.id
join bars.table_1 dih on dih.customer_id=c.id
join bars.table_2 ba on ba.bar_id=dih.bar_id
join foo_info bi on bi.field_1 = ba.field_1 and bi.customer_name is not null
where dih.bar_type_id=2
    and dih.created_on>current_timestamp-interval'30 days'
group by 1

;

Plan (it said no indexes found):

                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1027786.18..1028521.02 rows=41991 width=12)
   Group Key: p.id
   ->  Sort  (cost=1027786.18..1027891.15 rows=41991 width=30)
         Sort Key: p.id
         ->  Hash Join  (cost=771.36..1024561.73 rows=41991 width=30)
               Hash Cond: ((ba.field_1)::text = (bi.field_1)::text)
               ->  Nested Loop  (cost=1.30..1023214.29 rows=41991 width=11)
                     ->  Nested Loop  (cost=0.87..1000310.55 rows=41991 width=11)
                           ->  Nested Loop  (cost=0.43..717486.22 rows=41991 width=11)
                                 ->  Seq Scan on table_1 dih  (cost=0.00..408446.64 rows=50173 width=8)
                                       Filter: ((bar_type_id = 2) AND (created_on > (now() - '30 days'::interval)))
                                 ->  Index Scan using table_2__pk_bar_id on table_2 ba  (cost=0.43..6.15 rows=1 width=11)
                                       Index Cond: (bar_id = dih.bar_id)
                           ->  Index Scan using customers_pkey on customers c  (cost=0.43..6.73 rows=1 width=8)
                                 Index Cond: (id = dih.customer_id)
                     ->  Index Only Scan using people_pkey on people p  (cost=0.43..0.54 rows=1 width=4)
                           Index Cond: (id = c.person_id)
               ->  Hash  (cost=547.25..547.25 rows=17825 width=33)
                     ->  Seq Scan on foo_info bi  (cost=0.00..547.25 rows=17825 width=33)
                           Filter: (customer_name IS NOT NULL)
(20 rows)

Plan after I created an index on created_on:

                                                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=621942.73..622677.47 rows=41985 width=12)
   Group Key: p.id
   ->  Sort  (cost=621942.73..622047.70 rows=41985 width=30)
         Sort Key: p.id
         ->  Hash Join  (cost=771.80..618718.79 rows=41985 width=30)
               Hash Cond: ((ba.field_1)::text = (bi.field_1)::text)
               ->  Nested Loop  (cost=1.74..617371.44 rows=41985 width=11)
                     ->  Nested Loop  (cost=1.31..594470.96 rows=41985 width=11)
                           ->  Nested Loop  (cost=0.87..311668.09 rows=41985 width=11)
                                 ->  Index Scan using table_1_created_on_idx on table_1 dih  (cost=0.44..2708.33 rows=50156 width=8)
                                       Index Cond: (created_on > (now() - '30 days'::interval))
                                       Filter: (bar_type_id = 2)
                                 ->  Index Scan using table_2__pk_bar_id on table_2 ba  (cost=0.43..6.15 rows=1 width=11)
                                       Index Cond: (bar_id = dih.bar_id)
                           ->  Index Scan using customers_pkey on customers c  (cost=0.43..6.73 rows=1 width=8)
                                 Index Cond: (id = dih.customer_id)
                     ->  Index Only Scan using people_pkey on people p  (cost=0.43..0.54 rows=1 width=4)
                           Index Cond: (id = c.person_id)
               ->  Hash  (cost=547.25..547.25 rows=17825 width=33)
                     ->  Seq Scan on foo_info bi  (cost=0.00..547.25 rows=17825 width=33)
                           Filter: (customer_name IS NOT NULL)
(21 rows)

jfinzel commented 6 years ago

Here is another case. Original query actually times out after 12 hours (lots of data here). Obvious missing customer_id index in the left join lateral. It has to do a seq scan in each loop:

select
        ltrim(ba.foo_number :: text,'0'::text) as foo_number, ba.field_1::text
        ,string_agg(distinct l.customer_id::text, '|') as value
    from bars.table_2 ba
    inner join customers c on c.person_id= ba.person_id
    left join lateral (select l.customer_id as customer_id
                      from oreos l
                      where l.customer_id = c.id
                      and l.status_cd in ('happy','happy_super','in_sad','in_sad_super','woohoo')
                      limit 1
                    ) l on true
    where nullif(ltrim(ba.foo_number :: text,'0'::text),'') is not null and ba.field_1 is not null
    group by 1,2;

Plan (dexter says no indexes found):

                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=632357813747.76..632358072383.43 rows=3497595 width=96)
   Group Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
   ->  Sort  (cost=632357813747.76..632357848109.72 rows=13744783 width=68)
         Sort Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
         ->  Nested Loop Left Join  (cost=417941.32..632355620388.12 rows=13744783 width=68)
               ->  Gather  (cost=417941.32..3380621.18 rows=13744783 width=20)
                     Workers Planned: 3
                     ->  Hash Join  (cost=416941.32..2005142.88 rows=4433801 width=20)
                           Hash Cond: (c.person_id = ba.person_id)
                           ->  Parallel Seq Scan on customers c  (cost=0.00..513188.08 rows=3443908 width=8)
                           ->  Hash  (cost=266275.53..266275.53 rows=8206463 width=20)
                                 ->  Seq Scan on table_2 ba  (cost=0.00..266275.53 rows=8206463 width=20)
                                       Filter: ((field_1 IS NOT NULL) AND (NULLIF(ltrim((foo_number)::text, '0'::text), ''::text) IS NOT NULL))
               ->  Limit  (cost=0.00..46006.69 rows=1 width=4)
                     ->  Seq Scan on oreos l  (cost=0.00..690100.31 rows=15 width=4)
                           Filter: ((customer_id = c.id) AND ((status_cd)::text = ANY ('{happy,happy_super,in_sad,in_sad_super,woohoo}'::text[])))
(16 rows)

After I manually create index on customer_id:

                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=71826098.56..72084734.24 rows=3497595 width=96)
   Group Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
   ->  Sort  (cost=71826098.56..71860460.52 rows=13744784 width=68)
         Sort Key: (ltrim((ba.foo_number)::text, '0'::text)), ((ba.field_1)::text)
         ->  Nested Loop Left Join  (cost=417941.75..69632738.79 rows=13744784 width=68)
               ->  Gather  (cost=417941.32..3380624.29 rows=13744784 width=20)
                     Workers Planned: 3
                     ->  Hash Join  (cost=416941.32..2005145.89 rows=4433801 width=20)
                           Hash Cond: (c.person_id = ba.person_id)
                           ->  Parallel Seq Scan on customers c  (cost=0.00..513189.15 rows=3443915 width=8)
                           ->  Hash  (cost=266275.53..266275.53 rows=8206463 width=20)
                                 ->  Seq Scan on table_2 ba  (cost=0.00..266275.53 rows=8206463 width=20)
                                       Filter: ((field_1 IS NOT NULL) AND (NULLIF(ltrim((foo_number)::text, '0'::text), ''::text) IS NOT NULL))
               ->  Limit  (cost=0.43..4.80 rows=1 width=4)
                     ->  Index Scan using oreos_customer_id_idx on oreos l  (cost=0.43..65.87 rows=15 width=4)
                           Index Cond: (customer_id = c.id)
                           Filter: ((status_cd)::text = ANY ('{happy,happy_super,in_sad,in_sad_super,woohoo}'::text[]))
(17 rows)

ankane commented 6 years ago

Thanks @jfinzel, can you share the --log-level debug2 output for each of them?

jfinzel commented 6 years ago

It looks like you are close on both counts - would love if cost savings is configurable! Very interested why the second query here shows the massive cost savings but doesn't do it.

Last analyze: cnu.foo_info : 2018-02-12T12:45:09-06:00
Last analyze: cnu.customers : 2018-02-12T12:44:40-06:00
Last analyze: bars.table_2 : 2018-02-12T12:44:59-06:00
Last analyze: bars.table_1 : 2018-02-12T12:44:56-06:00
Last analyze: john.people : 2018-02-12T12:45:52-06:00
No new indexes found
--------------------------------------------------------------------------------
Query 0200d1693157e91f1b595826c1df124ea9e56c5832
Start: 1464495.4
Pass1: 961150.04 : bars.table_1 (created_on)
Pass2: 961150.04 : bars.table_1 (created_on)
Final: 1464495.4 : None
Need 50% cost savings to suggest index

select
        p.id as person_id,
        count(distinct bi.customer_name) as value
from people p
join customers c on c.person_id=p.id
join bars.table_1 dih on dih.customer_id=c.id
join bars.table_2 ba on ba.bar_id=dih.bar_id
join foo_info bi on bi.field_1 = ba.field_1 and bi.customer_name is not null
where dih.bar_type_id=2
        and dih.created_on>current_timestamp-interval'30 days'
group by 1

;

Last analyze: cnu.customers : 2018-02-12T12:44:40-06:00
Last analyze: cnu.oreos : 2018-02-12T12:46:57-06:00
Last analyze: bars.table_2 : 2018-02-12T12:44:59-06:00
No new indexes found
--------------------------------------------------------------------------------
Query 022ca61698c1c6a1d725fd73d93db465275781dfdc
Start: 576982091948.52
Pass1: 61343598.12 : cnu.oreos (customer_id), bars.table_2 (person_id)
Pass2: 61343598.12 : cnu.oreos (customer_id), bars.table_2 (person_id)
Final: 576982091948.52 : None

select
            ltrim(ba.foo_number :: text,'0'::text) as foo_number, ba.field_1::text
            ,string_agg(distinct l.customer_id::text, '|') as value
        from bars.table_2 ba
        inner join customers c on c.person_id= ba.person_id
        left join lateral (select l.customer_id as customer_id
                      from oreos l
                      where l.customer_id = c.id
                      and l.status_cd in ('happy','happy_super','in_sad','in_sad_super','woohoo')
                      limit 1
                    ) l on true
        where nullif(ltrim(ba.foo_number :: text,'0'::text),'') is not null and ba.field_1 is not null
        group by 1,2;

ankane commented 6 years ago

For query 1, it may make sense to have an absolute cost savings in addition to a percentage, as 30% can be significant if a query has a very high cost.

For query 2, the cost savings is driven by two indexes, and Dexter doesn't have a good way to figure out if both are really needed at the moment. This has led to poor index suggestions in my experience, so it won't suggest for now. Definitely an area for improvement.

jfinzel commented 6 years ago

@ankane regarding case 2, I would suggest that dexter tries adding the indexes one at a time and checking cost. I can tell you the customer_id index is 1000x more important than the person_id index here. But you could tell that very easily if you added them in order.

Also, there are certainly going to be cases where a query could benefit from more than 1 index being added. I haven't yet provided you with some of our queries that have 20+ joins :).

Interesting to note and perhaps worthy of documentation? - if I lower random_page_cost to 2, which is actually more appropriate for the system, it creates the index. It is worth mentioning that because the indexes are cost-based, it is important that your cost settings are appropriately tuned.

EDIT: Regarding cost savings, agreed, but it should be configurable in any case because depending on your environment, you may want to tune this value up or down depending on how critical read vs. write performance is.

jfinzel commented 6 years ago

@ankane any update or at least ETA on this?

ankane commented 6 years ago

Hey @jfinzel, both cases should be addressed now. There's now a --min-cost-savings-pct option.

dexter --min-cost-savings-pct 30 ...

Let me know how it goes.

ankane commented 6 years ago

Hey @jfinzel, have you had a chance to try the latest version?

jfinzel commented 6 years ago

@ankane I have been busy but am now getting around to this. I have updated to latest version and will let you know how it goes. Many thanks again!

ankane commented 6 years ago

Hey @jfinzel, going to close this out. Let me know if you find anything.

ankane / dexter

Indexes with clear benefit not picked up #24