Open pdobacz opened 2 years ago
Unfortunately, when trying (3.) manually, I ran into more problems, where the *-bucket was missorted, likely due to an Incremental Sort
node being used:
SET pg_diffix.session_access_level = 'direct'; DELETE FROM pg_seclabel WHERE provider = 'pg_diffix' AND label = 'aid'; SECURITY LABEL FOR pg_diffix ON COLUMN taxi.hack IS 'aid'; SET pg_diffix.strict = off; SET pg_diffix.noise_layer_sd = 0.0;SET pg_diffix.low_count_layer_sd = 0.5 ; SET pg_diffix.low_count_min_threshold = 1;SET pg_diffix.low_count_mean_gap = 0;SET pg_diffix.outlier_count_min = 0;SET pg_diffix.outlier_count_max = 0;SET pg_diffix.top_count_min = 1;SET pg_diffix.top_count_max = 1; SET pg_diffix.salt = "diffix"; SET pg_diffix.session_access_level = 'anonymized_trusted'; \pset pager off
SELECT sf_flag, dropoff_datetime FROM taxi GROUP BY 1, 2 ORDER BY 1 NULLS FIRST, 2 NULLS FIRST, diffix.is_suppress_bin(*) DESC LIMIT 40;
(...SNIP...)
| 2013-01-08 00:24:00
| 2013-01-08 00:24:21
| 2013-01-08 00:25:00
| 2013-01-08 00:26:00
| 2013-01-08 00:27:00
* | *
| 2013-01-08 00:28:00
| 2013-01-08 00:29:00
| 2013-01-08 00:30:00
| 2013-01-08 00:31:00
| 2013-01-08 00:32:00
| 2013-01-08 00:33:00
| 2013-01-08 00:34:00
| 2013-01-08 00:35:00
(40 rows)
As you can see above, the *-bucket row doesn't observe the requested sorting. reference
is getting this right.
The Incremental Sort
node:
prop_test=# explain SELECT sf_flag, dropoff_datetime FROM taxi GROUP BY 1, 2 ORDER BY 1 NULLS FIRST, 2 NULLS FIRST, diffix.is_suppress_bin(*) DESC LIMIT 40;
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Limit
-> Incremental Sort
Sort Key: sf_flag NULLS FIRST, dropoff_datetime NULLS FIRST, (diffix.is_suppress_bin(*)) DESC
Presorted Key: sf_flag, dropoff_datetime
-> Custom Scan (BucketScan)
-> GroupAggregate
Group Key: sf_flag, dropoff_datetime
-> Sort
Sort Key: sf_flag NULLS FIRST, dropoff_datetime NULLS FIRST
-> Seq Scan on taxi
Can we use get_relation_stats hook to give the column a distribution that does not trigger that plan path?
Can we use get_relation_stats hook to give the column a distribution that does not trigger that plan path?
This sounds a little complex and might have some unexpected side-effects - judging just after throwing a first glance. Let's keep it as another possible solution.
Also: SET enable_incremental_sort=off;
seems to fix the Incremental Sort
problem, but might not be the solution we're looking for.
The spot taken by the *-bucket row in the results of a query can vary from query to query (and sometimes for the same query issued before or after
ANALYZE;
!).In a nutshell, the
Sort
plan node may or may not be pushed down by the planner, to a spot before theAgg
/BucketScan
nodes. If it is then theORDER BY
sorting will not impact it, meaning it will come first. In other circumstances, if theSort
node comes last, it is impacted byORDER BY
.Example query where it didn't get pushed down:
And where it did (because
id
has many unique columns perANALYZE;
, and theHashAggregate
has been changed toSort -> GroupAggregate
:Ideas from slack thread:
ORDER BY is_suppress_bin(*)
and let the planner figure it out