Closed sfkeller closed 4 years ago
Hi,
I'm not sure what you'd actually expect from this feature, as creating extended statistics shouldn't be a huge problem, even on a busy production server. Do you want to reproduce what CREATE STATISTICS does (so with the requirement to then take some samples to compute the statistics) but without the ShareUpdateExclusiveLock and/or to be able to do it on a read-only standby, or simulate specific fake FD statistics for some values by telling what the extended statistic rows are?
Maybe you'd instead want to detect what FD are missing, and for this pg_qualstats can help you, since it'll give you the selectivity estimation error since v2.
Thx for pointing me to pg_qualstats. I'll have a look at it.
As for the expectations for this feature, can you @ankane pls. help explaining, since you suggested here https://github.com/ankane/dexter/issues/32#issuecomment-613167868 that Dexter could use this feature?
Hey @rjuju, it'd be the first case (reproduce what CREATE STATISTICS
does without locks and only visible to the current session). For index suggestions, Dexter essentially brute forces hypothetical indexes to see if they improve the query plan, and with a feature like this, I think it could do the same for statistics.
One thing still bothers me. For extended statistics investigation, I'm not sure that having the explain without analyze is really that useful. Unlike hypothetical indexes (where you want to know if the index is used and how), you probably want to see here how far are the estimates with and without various extended statistics compared to the real execution plan, right? So that's clearly doable in hypopg, but that's a radical change in its heuristics.
Hey @rjuju, thanks for the response! I was thinking about it from the perspective of only recommending them if they significantly improved the estimated query plan (rather than improving row estimates). Here's an example from this post.
CREATE TABLE pgqs AS SELECT i%2 val1 , (i+1)%2 val2 FROM generate_series(1, 50000) i;
ANALYZE pgqs;
EXPLAIN SELECT * FROM pgqs t1 JOIN pgqs t2 ON t1.val1 = t2.val1 WHERE t1.val1 = 0 AND t1.val2 = 0 order by t1.val2;
CREATE STATISTICS pgqs_stats ON val1, val2 FROM pgqs;
ANALYZE pgqs;
EXPLAIN SELECT * FROM pgqs t1 JOIN pgqs t2 ON t1.val1 = t2.val1 WHERE t1.val1 = 0 AND t1.val2 = 0 order by t1.val2;
In this case, the plan goes from:
Nested Loop (cost=0.00..3884194.00 rows=310587500 width=20)
-> Seq Scan on pgqs t2 (cost=0.00..847.00 rows=24847 width=8)
Filter: (val1 = 0)
-> Materialize (cost=0.00..1034.50 rows=12500 width=8)
-> Seq Scan on pgqs t1 (cost=0.00..972.00 rows=12500 width=8)
Filter: ((val1 = 0) AND (val2 = 0))
To
Nested Loop (cost=0.00..2069.52 rows=25052 width=20)
-> Seq Scan on pgqs t1 (cost=0.00..972.00 rows=1 width=8)
Filter: ((val1 = 0) AND (val2 = 0))
-> Seq Scan on pgqs t2 (cost=0.00..847.00 rows=25052 width=8)
Filter: (val1 = 0)
So extended statistics would be recommended. However, I haven't used extended statistics in a production environment, so I may be missing something.
Also, if injecting hypothetical statistics into the query planner is significantly different than hypothetical indexes, it's probably not worth the trouble.
My point was that you maybe can't naively assume that "25052 rows has to be correct while 310587500 has to be wrong, and the only reason it changed was extended statistics.". You don't know the real selectivity, and also you may be comparing outdated statistics with one you freshly gathered when creating hypothetical statistics.
Makes sense. I wouldn't expect it to be correct all of the time (same with hypothetical indexes).
So I looked a little bit at the extended statistic code, and while the definition of the extended statistic seems pluggable (only need to fill RelOptInfo->statlist), the data part unfortunately doesn't seem to be pluggable. See statext_mcv_load
(https://github.com/postgres/postgres/blob/master/src/backend/statistics/mcv.c#L557) called in mcv_clauselist_selectivity
, which can't be worked around.
I admittedly don't know the HypoPG and PG planner internals (RelOptInfo?) good enough. So excuse my newbies questions:
Why (1.) is "the data part unfortunately doesn't seem to be pluggable" a show stopper? Because HyopPG can't "fake" it - and because this would mean to refactor the PG mcv.c code?
Then (2.): Any planner should consider as much of statistics as it get's - and now with CREATE STATISTICS it gets some more, like multivariate statistics (dependencies, ndistinct and mcv) [1] [2]. Why wouldn't "hypothetical statistics info" help simulating this info, similar to "hypothetical indexes"?
[1] https://www.citusdata.com/blog/2018/03/06/postgres-planner-and-its-usage-of-statistics/ [2] https://www.postgresql.org/docs/current/planner-stats-details.html
Oh sorry, I should have given more details. For the record RelOptInfo is a C data structure that holds a lot of metadata about relations that are being used during planning. For instance it contains the list of indexes (as a list of IndexOptInfo, which similarly describes an index), that 3rd party code can manipulate to add hypothetical indexes, or even prevent planner from considering some index.
1) hypopg can't fake it because the core postgres will retrieve the content from its cache, and it's not possible to insert fake data in cache. So yes the only way to allow 3rd party code to provide hypothetical extended statistics would require a new hook that would be called if provided instead of statext_mcv_load.
2) I'm not sure that I get your question. Do you mean "is there any reason not to support 3rd party provided extended statistics, same as we already support 3rd party regular statistics"? If that's the case, I'm not sure why it wasn't added when the feature was introduced. I guess no one thought about it and no one asked until now.
Thx.
If that's the case, I'm not sure why it wasn't added when the feature was introduced. I guess no one thought about it and no one asked until now.
Do you think, it's much work to implement this feature (e.g. for a C++ programmer student who is good, but uninitiated in PG internals)? Or should I ask this first over at PG IRC or at PG dev mailing list?
The change in postgres should be trivial (more or less add the hook declaration and call it rather than the function if not NULL, and there are multiple other hooks implementation to get inspiration from), but asking on pgsql-hackers first could be a good idea to make sure that this is wanted and there are no caveats.
Thanks for looking into it and the explanation @rjuju.
@sfkeller please let us know what you hear.
@sfkeller please let us know what you hear.
Yes, will do.
And because I think Automated Database Query Optimization is a very hot topic, I dared to involve @s-hironobu sensei and @laurenz into the discussion.
I somewhat agree with @rjuju. Extended statistics are great, but if you don't know the actual values, you cannot tell if they improve an estimate.
Extended statistics can work both ways. Here is a little example:
CREATE TABLE address (
city integer NOT NULL,
zip integer NOT NULL,
street integer NOT NULL
);
INSERT INTO address
SELECT i/10000 + 1, i/100 + 1, i
FROM generate_series(1, 1000000) AS i;
ANALYZE address;
city
and zip
are correlated. Now I run two queries, one where the ZIP code is in the city and one where it isn't:
EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4200;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Seq Scan on address (cost=0.00..20406.00 rows=1 width=12) (actual time=28.848..55.878 rows=100 loops=1)
Filter: ((city = 42) AND (zip = 4200))
Rows Removed by Filter: 999900
Planning Time: 0.111 ms
Execution Time: 55.944 ms
(5 rows)
EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4300;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Seq Scan on address (cost=0.00..20406.00 rows=1 width=12) (actual time=54.109..54.109 rows=0 loops=1)
Filter: ((city = 42) AND (zip = 4300))
Rows Removed by Filter: 1000000
Planning Time: 0.125 ms
Execution Time: 54.145 ms
(5 rows)
The first estimate is off, the second is very good.
Now let's calculate extended statistics and try again:
CREATE STATISTICS addr_stats (dependencies) ON city, zip FROM address;
ANALYZE address;
EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4200;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on address (cost=0.00..20406.00 rows=100 width=12) (actual time=22.586..48.996 rows=100 loops=1)
Filter: ((city = 42) AND (zip = 4200))
Rows Removed by Filter: 999900
Planning Time: 0.219 ms
Execution Time: 49.020 ms
(5 rows)
EXPLAIN (ANALYZE) SELECT * FROM address WHERE city = 42 AND zip = 4300;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Seq Scan on address (cost=0.00..20406.00 rows=100 width=12) (actual time=53.904..53.904 rows=0 loops=1)
Filter: ((city = 42) AND (zip = 4300))
Rows Removed by Filter: 1000000
Planning Time: 0.106 ms
Execution Time: 53.933 ms
(5 rows)
Now the first estimate is spot on, but the second is off.
Since bad row count estimates will lead to bad cost estimates further down the road, a reduced cost
value caused by extended statistics is not a proof that things actually got better.
Sorry to ask: But why is in the second example the second query plan off? Isn't actual "time=53.904..53.904" is much closer to "actual time=22.586..48.996" regarding "Execution Time: 53.933 ms"?
And more basic: IMHO if extended statistics is not contributing to better row count estimates in practical situations, then either statistics is outdated (which means a DB admin can do something against this) - or the current PG planner is failing to take e.g. knowledge of func. dependency into account, isn'it?
The plan is off because it estimates 100 rows, when really there are 0. The execution time is pretty much the same in all examples, because the database does pretty much the same thing.
As I said, misestimates in row count may lead to misestimates in execution costs later on (that "later on" is not part of my example). I am just trying to show that extended statistics can lead to worse estimates. So without seeing the real values it is hard to says if extended statistics improves a query plan or not.
The dependency adjustment made with extended statistics relies on the queries using matching values. In this example, extended statistics rely on people always searching for existing city
/zip
combinations. If you cannot rely on that, you are better off not using extended statistics.
So, are you saying: Because there are chances plans go worse without perfect statistics, let's forget about pg_statistic_ext
(or pg_stats
) at all? Or is this just a warning about the current (sampling) statistics approach in PG? Or just saying pg_statistic_ext
is not yet good enough?
_(Shameless plug: IMHO this is going off topic. Let's take the chance to discuss this on my next online seminar :-) https://www.swisspug.org/wiki/index.php/Agenda#Seminar:_On_Pure_Synthetic_Data_Generation_for_Databases )_
I don't think this is going off-topic.
@rjuju said:
One thing still bothers me. For extended statistics investigation, I'm not sure that having the explain without analyze is really that useful. Unlike hypothetical indexes (where you want to know if the index is used and how), you probably want to see here how far are the estimates with and without various extended statistics compared to the real execution plan, right?
Extended statistics are incredibly cool, but from a plain EXPLAIN
(without ANALYZE
) you cannot tell if they will improve the query plan.
The whole raison d'être of HypoPG is to create an object (index) hypothetically (because creating a real index takes a long time, uses disk space and requires an ACCESS EXCLUSIVE
lock on the table). So you can answer the question “can that index be used” without actually creating the index, which is useful.
Extended statistics fall into a different category: They are easy to create, don't use a lot of disk space and need no ACCESS EXCLUSIVE
lock on the table. So why create hypothetical extended statistics when you can create real ones? And while you can get useful insights from a hypothetical index with plain EXPLAIN
, that is not so certain for extended statistics (as I tried to explain).
Now I understand! Thx for your patience. So, we can close this issue for now?
Agreed. Thanks all for the feedback!
This is a feature suggestion along the lines of functions like
hypopg_create_statistics
.Optimization assumes that columns are independent, so a query plan can be enhanced not only by an additional index, but to let optimizer know about functional dependencies (FD) between columns. Real world datasets often have FD between columns because of denormalization or just because the data is like this.
PostgreSQL offers CREATE STATISTICS on FD.
Expected behaviour: Adding
hypopg_create_statistics
in order to simulate existence of FD.Example: Looking e.g. at this test database from ATP tennis tour there's at least one FD, namely between
tourney_id
andtourney_name
. So this is what's ususally done:BTW: CREATE STATISTICS currently also supports
ndistinct
, which enables n-distinct statistics, andmcv
which enables most-common values lists.