Index size estimation is sometimes very inaccurate

HypoPG / hypopg

Hypothetical Indexes for PostgreSQL

https://hypopg.readthedocs.io

Other

1.39k stars 59 forks source link

Index size estimation is sometimes very inaccurate #46

Closed Bensk1 closed 4 years ago

Bensk1 commented 4 years ago

I have a TPC-H data set with a scale factor of 10 in Postgres 12 (12.1 (Ubuntu 12.1-1.pgdg19.10+1). If I create a multi-column index, it is in certain cases larger than a single-column index:

select * from hypopg_create_index('create index on lineitem (l_partkey,l_extendedprice,l_quantity)');
select * from hypopg_create_index('create index on lineitem (l_partkey)');

tpch___10=# SELECT indexname, pg_size_pretty(hypopg_relation_size(indexrelid)) FROM hypopg_list_indexes() ;
                           indexname                           | pg_size_pretty
---------------------------------------------------------------+----------------
 <15540732>btree_lineitem_l_partkey_l_extendedprice_l_quantity | 164 MB
 <15540733>btree_lineitem_l_partkey                            | 1495 MB
(2 rows)

The results are the same if I exchange l_partkey, with l_orderkey. If I physically create these indexes via CREATE INDEX, the multi-column index has 2.2 and the single-column 1.1 GB.

Bensk1 commented 4 years ago

This seems to work better for a scale factor of 1. The multi-column index is estimated to have roughly 430 MB vs 150 MB for the single-column index.

rjuju commented 4 years ago

Hello,

Oh, that's interesting. Did you run an analyze on the table?

Could you run this query and show the output?

select attname, avg_width from pg_stats where tablename = 'lineitem';

Bensk1 commented 4 years ago

Yes, I did run analyye and also tried with analyze lineitem.

The result of your query:

tpch___10=# select attname, avg_width from pg_stats where tablename = 'lineitem';
     attname     | avg_width
-----------------+-----------
 l_receiptdate   |         4
 l_tax           |         4
 l_extendedprice |         8
 l_partkey       |         4
 l_orderkey      |         4
 l_suppkey       |         4
 l_shipdate      |         4
 l_returnflag    |         2
 l_quantity      |         5
 l_linenumber    |         4
 l_shipmode      |        11
 l_discount      |         4
 l_shipinstruct  |        26
 l_comment       |        27
 l_linestatus    |         2
 l_commitdate    |         4
(16 rows)

rjuju commented 4 years ago

Thanks! I think this is simply an overflow, and I unfortunately never tested an estimated size bigger than 4GB.

I'll look into it. BTW it's already clear that the estimated size will be twice as big as your index, but this is intended, as it tries to account for bloat that will happen once you start updating the index.

rjuju commented 4 years ago

This should be fixed as of https://github.com/HypoPG/hypopg/commit/2e8cca5fa2fc8ce30e672e7ebbdc6f35a0e72946 or https://github.com/HypoPG/hypopg/commit/b41bd64e6956f77a1d37474f0cac1aab612a72d1, depending on whether you're using v1 or v2.

Thanks a lot for the report!

Bensk1 commented 4 years ago

Thank you so much for your quick response and fix.