Add functional dependency between columns from extended statistics

sfkeller commented 4 years ago

This is a feature suggestion along the lines of functions like hypopg_create_statistics.

Optimization assumes that columns are independent, so a query plan can be enhanced not only by an additional index, but to let optimizer know about functional dependencies (FD) between columns. Real world datasets often have FD between columns because of denormalization or just because the data is like this.

PostgreSQL offers CREATE STATISTICS on FD.

Expected behaviour: Adding hypopg_create_statistics in order to simulate existence of FD.

Example: Looking e.g. at this test database from ATP tennis tour there's at least one FD, namely between tourney_id and tourney_name. So this is what's ususally done:

CREATE STATISTCS atp_matches_2019_tourney_id_tourney_name (DEPENDENCIES) 
    ON tourney_id, tourney_name 
    FROM atp_matches_2019;

ANALYZE atp_matches_2019; -- update extended statistics

SELECT * FROM pg_statistic_ext; -- now shows the entry of "atp_matches_2019_tourney_id_tourney_name"

BTW: CREATE STATISTICS currently also supports ndistinct, which enables n-distinct statistics, and mcv which enables most-common values lists.

rjuju commented 4 years ago

Hi,

I'm not sure what you'd actually expect from this feature, as creating extended statistics shouldn't be a huge problem, even on a busy production server. Do you want to reproduce what CREATE STATISTICS does (so with the requirement to then take some samples to compute the statistics) but without the ShareUpdateExclusiveLock and/or to be able to do it on a read-only standby, or simulate specific fake FD statistics for some values by telling what the extended statistic rows are?

Maybe you'd instead want to detect what FD are missing, and for this pg_qualstats can help you, since it'll give you the selectivity estimation error since v2.

sfkeller commented 4 years ago

Thx for pointing me to pg_qualstats. I'll have a look at it.

As for the expectations for this feature, can you @ankane pls. help explaining, since you suggested here https://github.com/ankane/dexter/issues/32#issuecomment-613167868 that Dexter could use this feature?

ankane commented 4 years ago

Hey @rjuju, it'd be the first case (reproduce what CREATE STATISTICS does without locks and only visible to the current session). For index suggestions, Dexter essentially brute forces hypothetical indexes to see if they improve the query plan, and with a feature like this, I think it could do the same for statistics.

rjuju commented 4 years ago

One thing still bothers me. For extended statistics investigation, I'm not sure that having the explain without analyze is really that useful. Unlike hypothetical indexes (where you want to know if the index is used and how), you probably want to see here how far are the estimates with and without various extended statistics compared to the real execution plan, right? So that's clearly doable in hypopg, but that's a radical change in its heuristics.

ankane commented 4 years ago

Hey @rjuju, thanks for the response! I was thinking about it from the perspective of only recommending them if they significantly improved the estimated query plan (rather than improving row estimates). Here's an example from this post.

CREATE TABLE pgqs AS SELECT  i%2 val1 , (i+1)%2 val2 FROM generate_series(1, 50000) i;

ANALYZE pgqs;

EXPLAIN SELECT * FROM pgqs t1 JOIN pgqs t2 ON t1.val1 = t2.val1 WHERE t1.val1 = 0 AND t1.val2 = 0 order by t1.val2;

CREATE STATISTICS pgqs_stats ON val1, val2 FROM pgqs;

ANALYZE pgqs;

EXPLAIN SELECT * FROM pgqs t1 JOIN pgqs t2 ON t1.val1 = t2.val1 WHERE t1.val1 = 0 AND t1.val2 = 0 order by t1.val2;

In this case, the plan goes from:

 Nested Loop  (cost=0.00..3884194.00 rows=310587500 width=20)
   ->  Seq Scan on pgqs t2  (cost=0.00..847.00 rows=24847 width=8)
         Filter: (val1 = 0)
   ->  Materialize  (cost=0.00..1034.50 rows=12500 width=8)
         ->  Seq Scan on pgqs t1  (cost=0.00..972.00 rows=12500 width=8)
               Filter: ((val1 = 0) AND (val2 = 0))

To

 Nested Loop  (cost=0.00..2069.52 rows=25052 width=20)
   ->  Seq Scan on pgqs t1  (cost=0.00..972.00 rows=1 width=8)
         Filter: ((val1 = 0) AND (val2 = 0))
   ->  Seq Scan on pgqs t2  (cost=0.00..847.00 rows=25052 width=8)
         Filter: (val1 = 0)

So extended statistics would be recommended. However, I haven't used extended statistics in a production environment, so I may be missing something.

Also, if injecting hypothetical statistics into the query planner is significantly different than hypothetical indexes, it's probably not worth the trouble.

rjuju commented 4 years ago

My point was that you maybe can't naively assume that "25052 rows has to be correct while 310587500 has to be wrong, and the only reason it changed was extended statistics.". You don't know the real selectivity, and also you may be comparing outdated statistics with one you freshly gathered when creating hypothetical statistics.

ankane commented 4 years ago

Makes sense. I wouldn't expect it to be correct all of the time (same with hypothetical indexes).

rjuju commented 4 years ago

So I looked a little bit at the extended statistic code, and while the definition of the extended statistic seems pluggable (only need to fill RelOptInfo->statlist), the data part unfortunately doesn't seem to be pluggable. See statext_mcv_load (https://github.com/postgres/postgres/blob/master/src/backend/statistics/mcv.c#L557) called in mcv_clauselist_selectivity, which can't be worked around.

sfkeller commented 4 years ago

I admittedly don't know the HypoPG and PG planner internals (RelOptInfo?) good enough. So excuse my newbies questions:

Why (1.) is "the data part unfortunately doesn't seem to be pluggable" a show stopper? Because HyopPG can't "fake" it - and because this would mean to refactor the PG mcv.c code?

Then (2.): Any planner should consider as much of statistics as it get's - and now with CREATE STATISTICS it gets some more, like multivariate statistics (dependencies, ndistinct and mcv) [1] [2]. Why wouldn't "hypothetical statistics info" help simulating this info, similar to "hypothetical indexes"?

[1] https://www.citusdata.com/blog/2018/03/06/postgres-planner-and-its-usage-of-statistics/ [2] https://www.postgresql.org/docs/current/planner-stats-details.html

rjuju commented 4 years ago

Oh sorry, I should have given more details. For the record RelOptInfo is a C data structure that holds a lot of metadata about relations that are being used during planning. For instance it contains the list of indexes (as a list of IndexOptInfo, which similarly describes an index), that 3rd party code can manipulate to add hypothetical indexes, or even prevent planner from considering some index.

1) hypopg can't fake it because the core postgres will retrieve the content from its cache, and it's not possible to insert fake data in cache. So yes the only way to allow 3rd party code to provide hypothetical extended statistics would require a new hook that would be called if provided instead of statext_mcv_load.

2) I'm not sure that I get your question. Do you mean "is there any reason not to support 3rd party provided extended statistics, same as we already support 3rd party regular statistics"? If that's the case, I'm not sure why it wasn't added when the feature was introduced. I guess no one thought about it and no one asked until now.

sfkeller commented 4 years ago

Thx.

If that's the case, I'm not sure why it wasn't added when the feature was introduced. I guess no one thought about it and no one asked until now.

Do you think, it's much work to implement this feature (e.g. for a C++ programmer student who is good, but uninitiated in PG internals)? Or should I ask this first over at PG IRC or at PG dev mailing list?

rjuju commented 4 years ago

The change in postgres should be trivial (more or less add the hook declaration and call it rather than the function if not NULL, and there are multiple other hooks implementation to get inspiration from), but asking on pgsql-hackers first could be a good idea to make sure that this is wanted and there are no caveats.

ankane commented 4 years ago

Thanks for looking into it and the explanation @rjuju.

@sfkeller please let us know what you hear.

sfkeller commented 4 years ago

@sfkeller please let us know what you hear.

Yes, will do.

And because I think Automated Database Query Optimization is a very hot topic, I dared to involve @s-hironobu sensei and @laurenz into the discussion.

laurenz commented 4 years ago

I somewhat agree with @rjuju. Extended statistics are great, but if you don't know the actual values, you cannot tell if they improve an estimate.

Extended statistics can work both ways. Here is a little example:

CREATE TABLE address (
   city integer NOT NULL,
   zip integer NOT NULL,
   street integer NOT NULL
);

INSERT INTO address
   SELECT i/10000 + 1, i/100 + 1, i
   FROM generate_series(1, 1000000) AS i;

ANALYZE address;

city and zip are correlated. Now I run two queries, one where the ZIP code is in the city and one where it isn't:

EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4200;

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on address  (cost=0.00..20406.00 rows=1 width=12) (actual time=28.848..55.878 rows=100 loops=1)
   Filter: ((city = 42) AND (zip = 4200))
   Rows Removed by Filter: 999900
 Planning Time: 0.111 ms
 Execution Time: 55.944 ms
(5 rows)

EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4300;

                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 Seq Scan on address  (cost=0.00..20406.00 rows=1 width=12) (actual time=54.109..54.109 rows=0 loops=1)
   Filter: ((city = 42) AND (zip = 4300))
   Rows Removed by Filter: 1000000
 Planning Time: 0.125 ms
 Execution Time: 54.145 ms
(5 rows)

The first estimate is off, the second is very good.

Now let's calculate extended statistics and try again:

CREATE STATISTICS addr_stats (dependencies) ON city, zip FROM address;

ANALYZE address;

EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4200;

                                                 QUERY PLAN                                                 
------------------------------------------------------------------------------------------------------------
 Seq Scan on address  (cost=0.00..20406.00 rows=100 width=12) (actual time=22.586..48.996 rows=100 loops=1)
   Filter: ((city = 42) AND (zip = 4200))
   Rows Removed by Filter: 999900
 Planning Time: 0.219 ms
 Execution Time: 49.020 ms
(5 rows)

EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4300;

                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 Seq Scan on address  (cost=0.00..20406.00 rows=100 width=12) (actual time=53.904..53.904 rows=0 loops=1)
   Filter: ((city = 42) AND (zip = 4300))
   Rows Removed by Filter: 1000000
 Planning Time: 0.106 ms
 Execution Time: 53.933 ms
(5 rows)

Now the first estimate is spot on, but the second is off.

Since bad row count estimates will lead to bad cost estimates further down the road, a reduced cost value caused by extended statistics is not a proof that things actually got better.

sfkeller commented 4 years ago

Sorry to ask: But why is in the second example the second query plan off? Isn't actual "time=53.904..53.904" is much closer to "actual time=22.586..48.996" regarding "Execution Time: 53.933 ms"?

And more basic: IMHO if extended statistics is not contributing to better row count estimates in practical situations, then either statistics is outdated (which means a DB admin can do something against this) - or the current PG planner is failing to take e.g. knowledge of func. dependency into account, isn'it?

laurenz commented 4 years ago

The plan is off because it estimates 100 rows, when really there are 0. The execution time is pretty much the same in all examples, because the database does pretty much the same thing.

As I said, misestimates in row count may lead to misestimates in execution costs later on (that "later on" is not part of my example). I am just trying to show that extended statistics can lead to worse estimates. So without seeing the real values it is hard to says if extended statistics improves a query plan or not.

The dependency adjustment made with extended statistics relies on the queries using matching values. In this example, extended statistics rely on people always searching for existing city/zip combinations. If you cannot rely on that, you are better off not using extended statistics.

sfkeller commented 4 years ago

So, are you saying: Because there are chances plans go worse without perfect statistics, let's forget about pg_statistic_ext (or pg_stats) at all? Or is this just a warning about the current (sampling) statistics approach in PG? Or just saying pg_statistic_ext is not yet good enough?

_(Shameless plug: IMHO this is going off topic. Let's take the chance to discuss this on my next online seminar :-) https://www.swisspug.org/wiki/index.php/Agenda#Seminar:_On_Pure_Synthetic_Data_Generation_for_Databases )_

laurenz commented 4 years ago

I don't think this is going off-topic.

@rjuju said:

One thing still bothers me. For extended statistics investigation, I'm not sure that having the explain without analyze is really that useful. Unlike hypothetical indexes (where you want to know if the index is used and how), you probably want to see here how far are the estimates with and without various extended statistics compared to the real execution plan, right?

Extended statistics are incredibly cool, but from a plain EXPLAIN (without ANALYZE) you cannot tell if they will improve the query plan.

The whole raison d'être of HypoPG is to create an object (index) hypothetically (because creating a real index takes a long time, uses disk space and requires an ACCESS EXCLUSIVE lock on the table). So you can answer the question “can that index be used” without actually creating the index, which is useful.

Extended statistics fall into a different category: They are easy to create, don't use a lot of disk space and need no ACCESS EXCLUSIVE lock on the table. So why create hypothetical extended statistics when you can create real ones? And while you can get useful insights from a hypothetical index with plain EXPLAIN, that is not so certain for extended statistics (as I tried to explain).

sfkeller commented 4 years ago

Now I understand! Thx for your patience. So, we can close this issue for now?

rjuju commented 4 years ago

Agreed. Thanks all for the feedback!

HypoPG / hypopg

Add functional dependency between columns from extended statistics #49