Closed LorenzHenk closed 5 years ago
Definitely a long promised feature.
@LorenzHenk I'll think about this and return with some feedback once I'm done with the docs for our latest version https://github.com/PostgREST/postgrest-docs/pull/239.
To extract the estimated count from a explain call, you can use
EXPLAIN (FORMAT JSON) $query
.FORMAT JSON
makes extracting the planned row count easier (i.e.explain_result["QUERY PLAN"][0]["Plan"]["Plan Rows"]
). The explain call would have to be done withoutlimit
andoffset
.
I agree it makes sense to use EXPLAIN to get an estimate, since it'll handle queries with arbitrary filters.
When you are interested in the count, the relative error is important. If you have an estimated count of
1000000
and the real count is1001000
, the error is small enough to be ignored. But with an estimated count of7
, a real count of28
would be a huge misprediction.
Have you experimented to see how the size of the error relates to the size of the true count? For instance, when the true count is 28, is it possible for the prediction to be 7, or would it be +/- something smaller?
It may be influenced by how often the statistics collector is run... https://www.postgresql.org/docs/current/monitoring-stats.html
How about actually counting as long as the row count is
<= 1000
and using the estimated value fromEXPLAIN
only if the row count exceeds1000
?
If the error difference is proportional to the true count, then perhaps this two-step process is overkill. But if significant errors can occur at any size then this seems like a good approach.
count=estimated-overfetched
What about supporting just "estimated" and "exact" for count, and always using the same estimation process, whether that's the two-step approach or the simple explain output? That way clients wouldn't need to know as much about our implementation and could leave it up to us to deliver a good estimate.
I agree on using the EXPLAIN Plan Rows
for this but how about if we allow it only for HEAD requests?
I think obtaining metadata maps pretty well to HEAD(which we currently don't support) and we would maintain our design of translating 1 http request to 1 sql query.
Pagination would involve doing two requests, but I think that would not be a major problem.
I'll propose an idea on how to support this use case. I think we can avoid offering a default estimate that could leave the end user wondering.
Right now we support doing custom logic with headers in Stored Procedures. So we can have an estimated count doing this:
create or replace function clients_estimated_count() returns void as $$ declare
expl json;
estimated int;
begin
if current_setting('request.header.prefer', true) = 'count=estimated-overfetched' then
EXECUTE $_$EXPLAIN (FORMAT JSON) SELECT "test"."clients".* FROM "test"."clients"$_$
INTO expl;
estimated = expl->0->'Plan'->'Plan Rows';
-- custom logic to reduce error here
-- estimated > 1000
PERFORM set_config('response.headers',
format('[{"Content-Range": "*/%s"}]', estimated), true);
end if;
end;$$ language plpgsql;
-- Custom Content-Range can be seen with:
-- curl -v -H "Prefer: count=estimated-overfetched" "localhost:3000/rpc/clients_estimated_count"
-- < Content-Range: */906
As it can be seen, custom logic for tuning the estimation can be applied. The drawback is that it's only for one table.
My proposal would be to allow this solution to be generic by leveraging the pre-request function. For allowing this we would need to:
Allow adding headers in pre-request to all endpoints https://github.com/PostgREST/postgrest/issues/1088
Provide a GUC that contains the SQL query PostgREST generates for an http request. Could be named request.generated_query
. Enabled by the user on config file. This would serve to parametrize the EXPLAIN in the above example and make it generic for all queries.
What do you guys think?
Have you experimented to see how the size of the error relates to the size of the true count? For instance, when the true count is 28, is it possible for the prediction to be 7, or would it be +/- something smaller? It may be influenced by how often the statistics collector is run... If the error difference is proportional to the true count, then perhaps this two-step process is overkill. But if significant errors can occur at any size then this seems like a good approach.
Estimation is a huge topic, but to sum it up:
When 10% of a tables data changes, ANALYZE
is automatically run by the auto-vacuum process.
The estimated row count depends on:
ANALYZE
)WHERE name = 'Blue Jeans' AND color = 'blue'
) leads to underestimations of the row count
In general, more complicated queries give wrong estimates more often.Function calls in the WHERE
-clause are especially tricky - you can set the estimated row count on function creation, but this has to be a static value (e.g. ROWS 100
, see ROWS
). At the moment there is no way to enhance the estimated row count for functions returning an altering count of rows, though PostgreSQL 12 will have support functions, which can calculate the estimated rows.
As PostgREST also works on views, the user can build complicated queries fairly quick. Views can look like this:
SELECT *
FROM some_table as st
WHERE st.some_value IN (
SELECT value
FROM heavy_function(st.other_value)
)
AND st.name = 'foo'
;
-- estimated count could be 230
-- real count could be 19
It ultimately depends on the creator of the tables / views - if PostgREST is used for tables and simple views only, the estimated row count should be fine - but for more complicated queries as stated above, it is not reliable.
What about supporting just "estimated" and "exact" for count, and always using the same estimation process, whether that's the two-step approach or the simple explain output? That way clients wouldn't need to know as much about our implementation and could leave it up to us to deliver a good estimate.
Not knowing about this can be a huge unexpected performance hit (or can have other unexpected implications), consider the following query:
SELECT public.fun(id)
FROM generate_series(1,10) AS id
WHERE id % 3 = 0
ORDER BY id DESC
LIMIT 1
public.fun
can do anything (e.g. insert into a table, call another function, etc.).
This query results in public.fun
being called exactly 1 time using id = 9
.
But if we automatically (and unknowingly for the user) set the LIMIT
to our threshold (let's stay with 1000
), public.fun
will be called with 9
, 6
, and 3
!
I agree on using the EXPLAIN Plan Rows for this but how about if we allow it only for HEAD requests?
Moving the count to a HEAD
request seems fine to me.
I'll propose an idea on how to support this use case. I think we can avoid offering a default estimate that could leave the end user wondering. My proposal would be to allow this solution to be generic by leveraging the pre-request function.
Note: Doing it through pre-request function, there is no reason for the user to use estimated-overfetched
- it could be custom-estimated
as well - that's up to the user.
The question is whether or not this would be repetitive for users, because most will want to have the exact same behavior.
How would this play along with moving the count request to a HEAD request?
@LorenzHenk For the pre-request
proposal there would not be a restriction on the HTTP method. It could work with GET normally.
And yes, the header name and function body could be up to the user. But I was thinking we could offer a recommended function that does the estimated count with the threshold you suggested. We could have this function in a how-to guide in our docs.
The benefit to have this in a pg function rather than in our codebase, is that it'd be easier to inspect/understand and modify(though this may be rare as you mentioned, I'm just trying to consider all the alternatives).
@steve-chavez Using a pre-request
function seems fine to me.
Would the pre-request
function take care of the count=estimated
case as well or would that one still be "hard-coded" in PostgREST?
Note: When a GUC containing the generated select-statement is added, https://postgrest.org/en/preview/api.html#accessing-request-headers-cookies-and-jwt-claims must be updated / rewritten to include the new GUC.
@LorenzHenk count=estimated
would be done with the HEAD request as proposed earlier. Using the unmodified EXPLAIN Plan Rows
.
Currently we're missing support for HEAD and adding it should be simple enough. We could implement that as a first step, and after that give the estimated-overfetch
design another thought(maybe it'd be simpler to just add it to our codebase).
@LorenzHenk I've implemented count=estimated
in #1383. It works for GET(not only for HEAD as I initially proposed) and HEAD.
You can build from source to test it. Also you could check the pg logs for seeing the generated EXPLAIN query(it doesn't contain LIMIT/OFFSET, only the WHERE).
Now, I'd like to revisit the estimated-overfetched
case.
How about actually counting as long as the row count is <= 1000 and using the estimated value from EXPLAIN only if the row count exceeds 1000?
If we do that won't we actually need a COUNT(*) before doing the EXPLAIN and thereby we would not be preventing the expensive COUNT?
Also, not sure if you are aware of our max-rows config, it seems related to your threshold proposal.
If we do that won't we actually need a COUNT(*) before doing the EXPLAIN and thereby we would not be preventing the expensive COUNT?
The idea is to limit the query by our threshold:
BEGIN;
CREATE TABLE t_sample AS SELECT * FROM generate_series(1, 2200) AS id;
SELECT jsonb_agg(res) AS query_result, count(*) AS limited_count
FROM (
SELECT *
FROM t_sample
LIMIT 1000
) AS res
;
ROLLBACK;
If limited_count
returns 1000
, the total count is at least 1000. At that point, we should move on as in count=estimated
- check what the EXPLAIN
estimated. If the estimated rows are below 1000
, then we should stick to 1000
, as our real count told us there are at least 1000
rows.
If limited_count
is < 1000
we have an exact count, no need to do the EXPLAIN
at all.
Also, not sure if you are aware of our max-rows config, it seems related to your threshold proposal.
Didn't know that one, we could extend its purpose instead of introducing a new config property.
@LorenzHenk Here's what we currently do with max-rows
, on a request with count=exact
we actualy generate two COUNTs:
CREATE TABLE t_sample AS SELECT * FROM generate_series(1, 2200) AS id;
-- postgrest.conf
max-rows = 1000
db-schema = "test"
-- request
curl -v -H "Prefer: count=exact" localhost:3000/t_sample
-- generated query
WITH pg_source AS (
SELECT "test"."t_sample".* FROM "test"."t_sample"
LIMIT 1000 OFFSET 0
)
SELECT
(SELECT pg_catalog.count(*) FROM "test"."t_sample" ) AS total_result_set,
pg_catalog.count(_postgrest_t) AS page_total,
array[]::text[] AS header,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pg_source) _postgrest_t;
-- This is the generated response, the Content-Range is formed by
--`page_total/total_result_set` COUNTs.
HTTP/1.1 206 Partial Content
--...
Content-Range: 0-999/2200
[{"id":1, ... {"id":999}, {"id":1000}]
As you can see the page_total
is the query result COUNT and the total_result_set
is the table total COUNT. The total_result_set
is unaffected by LIMIT/OFFSET.
With max-rows
this means that the page_total
will always be <= 1000, even if a limit superior to 1000 is specified in the url(e.g. /t_sample?limit=2000
) PostgREST will stick to 1000 for the LIMIT. An inferior limit can be supplied though(e.g. /t_sample?limit=400
).
If count=exact
is omitted there's no COUNT done for total_result_set
and is left empty(Content-Range: 0-999/*
).
Now with count=estimated
you can avoid the extra expensive COUNT(*) and instead we get the EXPLAIN Plan Rows for the last part of the Content-Range
(0-999/this one).
Not sure how to map the SQL logic you propose to our generated query(perhaps you could accommodate it?). But I'm thinking maybe your estimated-overfetched use case could be solved with a combination of max-rows
and count=estimated
.
# pseudo python code
def do_the_query(threshold, limit, offset, your_query):
return query(
WITH pg_source AS (
{your_query}
),
pg_source_limited AS (
SELECT *
FROM pg_source
LIMIT {limit} OFFSET {offset}
),
pg_source_threshold AS (
SELECT *
FROM pg_source
LIMIT {threshold}
)
SELECT
(SELECT pg_catalog.count(*) FROM pg_source_threshold) AS count_threshold_result,
pg_catalog.count(_postgrest_t) AS page_total,
array[]::text[] AS header,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pg_source_limited) _postgrest_t
)
max_rows = 1000
threshold = 800
limit = 500 # current limit, coming from GET parameter
offset = 50 # current offset, coming from GET parameter
your_query = """
SELECT "test"."t_sample".*
FROM "test"."t_sample"
"""
actual_limit = limit
if limit > max_rows:
actual_limit = max_rows
count_threshold_result, page_total, header, body = do_the_query(threshold, actual_limit, offset, your_query)
actual_count = count_threshold_result
if count_threshold_result > threshold:
# the amount we received is higher than the threshold we want to fetch
# so we check what the optimizer says - as in count=estimated
estimated_rows = get_estimated_rows(your_query)
if estimated_rows > count_threshold_result:
actual_count = estimated_rows
@LorenzHenk Much clearer now, thanks. The max-rows
config doesn't interact with our total COUNT right now. So what you suggest makes sense and is probably what most users would want.
However, how about if we make this the behavior for count=estimated
instead of having the additional count=estimated-overfetched
? We could document its interaction with max-rows
in our docs.
However, how about if we make this the behavior for count=estimated instead of having the additional count=estimated-overfetched?
The reason I'm talking about these two approaches separately is, that you don't know which query is built in the end. The underlying data source can be a table or a view.
Example:
CREATE VIEW test AS
SELECT long_running_function(id) AS huge_result_set
FROM t_sample;
In this case, I would use count=estimated
to know how many rows are in t_sample, but I would only fetch and show 1 result at a time (LIMIT 1
). Thus, the long_running_function
is called once.
If I used count=estimated-overfetched
instead, the LIMIT 1
would be ignored, as the LIMIT {threshold}
is higher (e.g. 1000).
Thus, the view would be called for the first 1000
rows, but only the result of the first row would be returned - the 1000
rows are only for counting. The long_running_function
is called 1000 times.
This is the reason why count=estimated
and count=estimated-overfetched
need to be separated. It's a special use-case, but I can see it happen.
If I used count=estimated-overfetched instead, the LIMIT 1 would be ignored, as the LIMIT {threshold} is higher (e.g. 1000). Thus, the view would be called for the first 1000 rows, but only the result of the first row would be returned - the 1000 rows are only for counting. The long_running_function is called 1000 times.
@LorenzHenk As it is right now, If you have a max-rows=1000
and you specify /test?limit=1
the LIMIT 1
would be applied, not ignored. You can specify LIMITs that are lower than max-rows
. If higher LIMITs are specified, the max-rows
LIMIT will prevail.
So, for the case you mention the long_running_function
would still be called once if you specified /test?limit=1
despite what max-rows
says.
Also, for cases like long_running_function
I think it'd be better to call them through our stored_procedures(not sure if you're aware of this feature) interface and make the function return a SETOF. That way you could have additional safeguards in the function and prevent an expensive call.
count=estimated
could be supported for stored procedures as well and you could get the function defined ROWS with GET /rpc/funcname Prefer: count=estimated
.
If you have a max-rows=1000 and you specify /test?limit=1 the LIMIT 1 would be applied, not ignored. So, for the case you mention the long_running_function would still be called once if you specified /test?limit=1 despite what max-rows says.
But if you add a Prefer: count=exact
header, it will still run a count(*)
over the whole query - without limit.
You said this is what is generated at the moment:
(SELECT pg_catalog.count(*) FROM "test"."t_sample" ) AS total_result_set,
Now think of t_sample
as a complicated view - the statement above would query the whole view.
Even though you say limit=1
, the count still uses the whole data set.
Also, for cases like long_running_function I think it'd be better to call them through our stored_procedures
Right, was a bad example. Think of it as a long-running view and you want limit=10
. It's still the same problem.
It's all about expectation.. if I used count=estimated, I would think of it using the optimizer statistics - not actually starting to fetch 1000 rows.
It's all about expectation.. if I used count=estimated, I would think of it using the optimizer statistics
True, I agree. The special behavior with max-rows
would be better separated.
Ok, estimated-overfetched
it's up for implementation now!
@LorenzHenk I've implemented estimated-overfetched in #1386.
The behavior is as we agreed, using max-rows
as the threshold and avoiding another config option.
Based on your python example, the logic I've added is like:
max_rows = 1000
# Use LIMIT maxRows + 1 as the threshold so we can determine below that maxRows was surpassed
count_threshold_result, page_total, header, body = do_the_query(max_rows + 1, limit, offset, your_query)
actual_count = count_threshold_result
if count_threshold_result > max_rows:
estimated_rows = get_estimated_rows(your_query)
if estimated_rows > count_threshold_result:
actual_count = estimated_rows
I've now realized that this behavior is actually a good default. We could recommend its use.
However, the name estimated-overfetched
seems more fit for a special use case. How about if we change your proposed names to these:
count=estimated
to count=planned
. The raw PostgreSQL query planner guess, no special logic.count=estimated-overfetched
to count=estimated
. PostgREST estimation based on the max-rows
logic we defined.What do you think?
We could recommend its use.
If you want to recommend count=estimated
(previously called count=estimated-overfetched
), you should also warn about the caveats mentioned before as well.
count=planned
can be recommended without worries.
What do you think?
I'm fine with the name changes, go for it.
@LorenzHenk Regarding the concern you mentioned before.
What I'm doing for the count query with the threshold is something like:
pg_source_threshold AS (
SELECT 1
FROM test -- this is the "test" view in your example that contains "long_running_function"
LIMIT {threshold}
)
That SELECT 1
will prevent long_running_function(id)
from getting called, if and only if the long_running_function
is marked as STABLE or IMMUTABLE. Here's a gist that proves that.
In general I think we can recommend that is always best to label the functions with the strictest volatility.
One other way would be to use computed/virtual columns and avoid having the expensive function in the view.
@LorenzHenk Finished the implementation in https://github.com/PostgREST/postgrest/pull/1386!
Also, based on our discussion, I added docs for planned count/estimated count(any correction is welcomed).
Thank you, this feature is very appreciated!
We would really like to have that feature implemented!
Our ideas
count=estimated
To extract the estimated count from a explain call, you can use
EXPLAIN (FORMAT JSON) $query
.FORMAT JSON
makes extracting the planned row count easier (i.e.explain_result["QUERY PLAN"][0]["Plan"]["Plan Rows"]
). The explain call would have to be done withoutlimit
andoffset
.count=estimated-overfetched
When you are interested in the count, the relative error is important. If you have an estimated count of
1000000
and the real count is1001000
, the error is small enough to be ignored. But with an estimated count of7
, a real count of28
would be a huge misprediction.In general, when having smaller row-counts, the estimated count should be as close to the real count as possible. How about actually counting as long as the row count is
<= 1000
and using the estimated value fromEXPLAIN
only if the row count exceeds1000
? Note:1000
is just a placeholder threashold - it just has to be small enough to not significantly increase the response time. This number would have to be adapted to thelimit
used for the query.This approach would have the following implications:
1000
from above) to check how many rows actually exist. If the resource you want to query is a view containing a function call, this function will be called more often than expected by the user supplying a limit. Thus, thisestimated-overfetched
operation must only be used by users who know the implicationsEXPLAIN
would only have to be called if the threshold is reachedRelated issues: #255 #273