PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
23.23k stars 1.02k forks source link

Estimated count #1378

Closed LorenzHenk closed 5 years ago

LorenzHenk commented 5 years ago

In future versions we will support Prefer: count=estimated to leverage the PostgreSQL statistics tables for a fast (and fairly accurate) result. -- Cite from http://postgrest.org/en/v5.2/admin.html

We would really like to have that feature implemented!

Our ideas

count=estimated

To extract the estimated count from a explain call, you can use EXPLAIN (FORMAT JSON) $query. FORMAT JSON makes extracting the planned row count easier (i.e. explain_result["QUERY PLAN"][0]["Plan"]["Plan Rows"]). The explain call would have to be done without limit and offset.

count=estimated-overfetched

When you are interested in the count, the relative error is important. If you have an estimated count of 1000000 and the real count is 1001000, the error is small enough to be ignored. But with an estimated count of 7, a real count of 28 would be a huge misprediction.

In general, when having smaller row-counts, the estimated count should be as close to the real count as possible. How about actually counting as long as the row count is <= 1000 and using the estimated value from EXPLAIN only if the row count exceeds 1000? Note: 1000 is just a placeholder threashold - it just has to be small enough to not significantly increase the response time. This number would have to be adapted to the limit used for the query.

This approach would have the following implications:

Related issues: #255 #273

steve-chavez commented 5 years ago

Definitely a long promised feature.

@LorenzHenk I'll think about this and return with some feedback once I'm done with the docs for our latest version https://github.com/PostgREST/postgrest-docs/pull/239.

begriffs commented 5 years ago

To extract the estimated count from a explain call, you can use EXPLAIN (FORMAT JSON) $query. FORMAT JSON makes extracting the planned row count easier (i.e. explain_result["QUERY PLAN"][0]["Plan"]["Plan Rows"]). The explain call would have to be done without limit and offset.

I agree it makes sense to use EXPLAIN to get an estimate, since it'll handle queries with arbitrary filters.

When you are interested in the count, the relative error is important. If you have an estimated count of 1000000 and the real count is 1001000, the error is small enough to be ignored. But with an estimated count of 7, a real count of 28 would be a huge misprediction.

Have you experimented to see how the size of the error relates to the size of the true count? For instance, when the true count is 28, is it possible for the prediction to be 7, or would it be +/- something smaller?

It may be influenced by how often the statistics collector is run... https://www.postgresql.org/docs/current/monitoring-stats.html

How about actually counting as long as the row count is <= 1000 and using the estimated value from EXPLAIN only if the row count exceeds 1000?

If the error difference is proportional to the true count, then perhaps this two-step process is overkill. But if significant errors can occur at any size then this seems like a good approach.

count=estimated-overfetched

What about supporting just "estimated" and "exact" for count, and always using the same estimation process, whether that's the two-step approach or the simple explain output? That way clients wouldn't need to know as much about our implementation and could leave it up to us to deliver a good estimate.

steve-chavez commented 5 years ago

count=estimated

I agree on using the EXPLAIN Plan Rows for this but how about if we allow it only for HEAD requests?

I think obtaining metadata maps pretty well to HEAD(which we currently don't support) and we would maintain our design of translating 1 http request to 1 sql query.

Pagination would involve doing two requests, but I think that would not be a major problem.

count=estimated-overfetched

I'll propose an idea on how to support this use case. I think we can avoid offering a default estimate that could leave the end user wondering.

Right now we support doing custom logic with headers in Stored Procedures. So we can have an estimated count doing this:

create or replace function clients_estimated_count() returns void as $$ declare
expl      json; 
estimated int;
begin

if current_setting('request.header.prefer', true) = 'count=estimated-overfetched' then
  EXECUTE $_$EXPLAIN (FORMAT JSON) SELECT "test"."clients".* FROM "test"."clients"$_$ 
    INTO expl;
  estimated = expl->0->'Plan'->'Plan Rows';
  -- custom logic to reduce error here
  -- estimated > 1000
  PERFORM set_config('response.headers', 
    format('[{"Content-Range": "*/%s"}]', estimated), true);
end if;

end;$$ language plpgsql;

-- Custom Content-Range can be seen with:
-- curl -v -H "Prefer: count=estimated-overfetched" "localhost:3000/rpc/clients_estimated_count"
-- < Content-Range: */906

As it can be seen, custom logic for tuning the estimation can be applied. The drawback is that it's only for one table.

My proposal would be to allow this solution to be generic by leveraging the pre-request function. For allowing this we would need to:


What do you guys think?

LorenzHenk commented 5 years ago

Have you experimented to see how the size of the error relates to the size of the true count? For instance, when the true count is 28, is it possible for the prediction to be 7, or would it be +/- something smaller? It may be influenced by how often the statistics collector is run... If the error difference is proportional to the true count, then perhaps this two-step process is overkill. But if significant errors can occur at any size then this seems like a good approach.

Estimation is a huge topic, but to sum it up: When 10% of a tables data changes, ANALYZE is automatically run by the auto-vacuum process. The estimated row count depends on:

Function calls in the WHERE-clause are especially tricky - you can set the estimated row count on function creation, but this has to be a static value (e.g. ROWS 100, see ROWS). At the moment there is no way to enhance the estimated row count for functions returning an altering count of rows, though PostgreSQL 12 will have support functions, which can calculate the estimated rows.

As PostgREST also works on views, the user can build complicated queries fairly quick. Views can look like this:

SELECT *
FROM  some_table as st
WHERE st.some_value IN (
  SELECT value
  FROM   heavy_function(st.other_value)
)
AND   st.name = 'foo'
;

-- estimated count could be 230
-- real count could be 19

It ultimately depends on the creator of the tables / views - if PostgREST is used for tables and simple views only, the estimated row count should be fine - but for more complicated queries as stated above, it is not reliable.

What about supporting just "estimated" and "exact" for count, and always using the same estimation process, whether that's the two-step approach or the simple explain output? That way clients wouldn't need to know as much about our implementation and could leave it up to us to deliver a good estimate.

Not knowing about this can be a huge unexpected performance hit (or can have other unexpected implications), consider the following query:

SELECT public.fun(id)
FROM  generate_series(1,10) AS id
WHERE id % 3 = 0
ORDER BY id DESC
LIMIT 1

public.fun can do anything (e.g. insert into a table, call another function, etc.). This query results in public.fun being called exactly 1 time using id = 9. But if we automatically (and unknowingly for the user) set the LIMIT to our threshold (let's stay with 1000), public.fun will be called with 9, 6, and 3!

I agree on using the EXPLAIN Plan Rows for this but how about if we allow it only for HEAD requests?

Moving the count to a HEAD request seems fine to me.

I'll propose an idea on how to support this use case. I think we can avoid offering a default estimate that could leave the end user wondering. My proposal would be to allow this solution to be generic by leveraging the pre-request function.

Note: Doing it through pre-request function, there is no reason for the user to use estimated-overfetched - it could be custom-estimated as well - that's up to the user.

The question is whether or not this would be repetitive for users, because most will want to have the exact same behavior.

How would this play along with moving the count request to a HEAD request?

steve-chavez commented 5 years ago

@LorenzHenk For the pre-request proposal there would not be a restriction on the HTTP method. It could work with GET normally.

And yes, the header name and function body could be up to the user. But I was thinking we could offer a recommended function that does the estimated count with the threshold you suggested. We could have this function in a how-to guide in our docs.

The benefit to have this in a pg function rather than in our codebase, is that it'd be easier to inspect/understand and modify(though this may be rare as you mentioned, I'm just trying to consider all the alternatives).

LorenzHenk commented 5 years ago

@steve-chavez Using a pre-request function seems fine to me.

Would the pre-request function take care of the count=estimated case as well or would that one still be "hard-coded" in PostgREST?

Note: When a GUC containing the generated select-statement is added, https://postgrest.org/en/preview/api.html#accessing-request-headers-cookies-and-jwt-claims must be updated / rewritten to include the new GUC.

steve-chavez commented 5 years ago

@LorenzHenk count=estimated would be done with the HEAD request as proposed earlier. Using the unmodified EXPLAIN Plan Rows.

Currently we're missing support for HEAD and adding it should be simple enough. We could implement that as a first step, and after that give the estimated-overfetch design another thought(maybe it'd be simpler to just add it to our codebase).

steve-chavez commented 5 years ago

@LorenzHenk I've implemented count=estimated in #1383. It works for GET(not only for HEAD as I initially proposed) and HEAD.

You can build from source to test it. Also you could check the pg logs for seeing the generated EXPLAIN query(it doesn't contain LIMIT/OFFSET, only the WHERE).

Now, I'd like to revisit the estimated-overfetched case.

How about actually counting as long as the row count is <= 1000 and using the estimated value from EXPLAIN only if the row count exceeds 1000?

If we do that won't we actually need a COUNT(*) before doing the EXPLAIN and thereby we would not be preventing the expensive COUNT?

Also, not sure if you are aware of our max-rows config, it seems related to your threshold proposal.

LorenzHenk commented 5 years ago

If we do that won't we actually need a COUNT(*) before doing the EXPLAIN and thereby we would not be preventing the expensive COUNT?

The idea is to limit the query by our threshold:

BEGIN;

CREATE TABLE t_sample AS SELECT * FROM generate_series(1, 2200) AS id;

SELECT  jsonb_agg(res) AS query_result, count(*) AS limited_count
FROM    (
    SELECT *
    FROM t_sample
    LIMIT 1000
) AS res
;

ROLLBACK;

If limited_count returns 1000, the total count is at least 1000. At that point, we should move on as in count=estimated - check what the EXPLAIN estimated. If the estimated rows are below 1000, then we should stick to 1000, as our real count told us there are at least 1000 rows.

If limited_count is < 1000 we have an exact count, no need to do the EXPLAIN at all.

Also, not sure if you are aware of our max-rows config, it seems related to your threshold proposal.

Didn't know that one, we could extend its purpose instead of introducing a new config property.

steve-chavez commented 5 years ago

@LorenzHenk Here's what we currently do with max-rows, on a request with count=exact we actualy generate two COUNTs:

CREATE TABLE t_sample AS SELECT * FROM generate_series(1, 2200) AS id;

-- postgrest.conf
max-rows = 1000
db-schema = "test"

-- request
curl -v -H "Prefer: count=exact" localhost:3000/t_sample

-- generated query
WITH pg_source AS (
  SELECT "test"."t_sample".* FROM "test"."t_sample"
  LIMIT 1000 OFFSET 0
)
SELECT
  (SELECT pg_catalog.count(*) FROM  "test"."t_sample" ) AS total_result_set,
  pg_catalog.count(_postgrest_t) AS page_total,
  array[]::text[] AS header,
  coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pg_source) _postgrest_t;

-- This is the generated response, the Content-Range is formed by 
--`page_total/total_result_set` COUNTs.

HTTP/1.1 206 Partial Content
--...
Content-Range: 0-999/2200

[{"id":1, ... {"id":999}, {"id":1000}]

As you can see the page_total is the query result COUNT and the total_result_set is the table total COUNT. The total_result_set is unaffected by LIMIT/OFFSET.

With max-rows this means that the page_total will always be <= 1000, even if a limit superior to 1000 is specified in the url(e.g. /t_sample?limit=2000) PostgREST will stick to 1000 for the LIMIT. An inferior limit can be supplied though(e.g. /t_sample?limit=400).

If count=exact is omitted there's no COUNT done for total_result_set and is left empty(Content-Range: 0-999/*).

Now with count=estimated you can avoid the extra expensive COUNT(*) and instead we get the EXPLAIN Plan Rows for the last part of the Content-Range(0-999/this one).

Not sure how to map the SQL logic you propose to our generated query(perhaps you could accommodate it?). But I'm thinking maybe your estimated-overfetched use case could be solved with a combination of max-rows and count=estimated.

LorenzHenk commented 5 years ago
# pseudo python code

def do_the_query(threshold, limit, offset, your_query):
  return query(
    WITH pg_source AS (
      {your_query}
    ),
    pg_source_limited AS (
      SELECT *
      FROM pg_source
      LIMIT {limit} OFFSET {offset}
    ),
    pg_source_threshold AS (
      SELECT *
      FROM pg_source
      LIMIT {threshold}
    )
    SELECT
      (SELECT pg_catalog.count(*) FROM pg_source_threshold) AS count_threshold_result,
      pg_catalog.count(_postgrest_t) AS page_total,
      array[]::text[] AS header,
      coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
    FROM ( SELECT * FROM pg_source_limited) _postgrest_t
  )

max_rows = 1000
threshold = 800

limit = 500 # current limit, coming from GET parameter
offset = 50 # current offset, coming from GET parameter

your_query = """
  SELECT "test"."t_sample".*
  FROM "test"."t_sample"
"""

actual_limit = limit
if limit > max_rows:
  actual_limit = max_rows

count_threshold_result, page_total, header, body = do_the_query(threshold, actual_limit, offset, your_query)

actual_count = count_threshold_result

if count_threshold_result > threshold: 
  # the amount we received is higher than the threshold we want to fetch
  # so we check what the optimizer says - as in count=estimated
  estimated_rows = get_estimated_rows(your_query)
  if estimated_rows > count_threshold_result:
    actual_count = estimated_rows
steve-chavez commented 5 years ago

@LorenzHenk Much clearer now, thanks. The max-rows config doesn't interact with our total COUNT right now. So what you suggest makes sense and is probably what most users would want.

However, how about if we make this the behavior for count=estimated instead of having the additional count=estimated-overfetched? We could document its interaction with max-rows in our docs.

LorenzHenk commented 5 years ago

However, how about if we make this the behavior for count=estimated instead of having the additional count=estimated-overfetched?

The reason I'm talking about these two approaches separately is, that you don't know which query is built in the end. The underlying data source can be a table or a view.

Example:

CREATE VIEW test AS
SELECT long_running_function(id) AS huge_result_set
FROM t_sample;

In this case, I would use count=estimated to know how many rows are in t_sample, but I would only fetch and show 1 result at a time (LIMIT 1). Thus, the long_running_function is called once.

If I used count=estimated-overfetched instead, the LIMIT 1 would be ignored, as the LIMIT {threshold} is higher (e.g. 1000). Thus, the view would be called for the first 1000 rows, but only the result of the first row would be returned - the 1000 rows are only for counting. The long_running_function is called 1000 times.

This is the reason why count=estimated and count=estimated-overfetched need to be separated. It's a special use-case, but I can see it happen.

steve-chavez commented 5 years ago

If I used count=estimated-overfetched instead, the LIMIT 1 would be ignored, as the LIMIT {threshold} is higher (e.g. 1000). Thus, the view would be called for the first 1000 rows, but only the result of the first row would be returned - the 1000 rows are only for counting. The long_running_function is called 1000 times.

@LorenzHenk As it is right now, If you have a max-rows=1000 and you specify /test?limit=1 the LIMIT 1 would be applied, not ignored. You can specify LIMITs that are lower than max-rows. If higher LIMITs are specified, the max-rows LIMIT will prevail.

So, for the case you mention the long_running_function would still be called once if you specified /test?limit=1 despite what max-rows says.

steve-chavez commented 5 years ago

Also, for cases like long_running_function I think it'd be better to call them through our stored_procedures(not sure if you're aware of this feature) interface and make the function return a SETOF. That way you could have additional safeguards in the function and prevent an expensive call.

count=estimated could be supported for stored procedures as well and you could get the function defined ROWS with GET /rpc/funcname Prefer: count=estimated.

LorenzHenk commented 5 years ago

If you have a max-rows=1000 and you specify /test?limit=1 the LIMIT 1 would be applied, not ignored. So, for the case you mention the long_running_function would still be called once if you specified /test?limit=1 despite what max-rows says.

But if you add a Prefer: count=exact header, it will still run a count(*) over the whole query - without limit. You said this is what is generated at the moment:

(SELECT pg_catalog.count(*) FROM "test"."t_sample" ) AS total_result_set,

Now think of t_sample as a complicated view - the statement above would query the whole view. Even though you say limit=1, the count still uses the whole data set.

LorenzHenk commented 5 years ago

Also, for cases like long_running_function I think it'd be better to call them through our stored_procedures

Right, was a bad example. Think of it as a long-running view and you want limit=10. It's still the same problem.

It's all about expectation.. if I used count=estimated, I would think of it using the optimizer statistics - not actually starting to fetch 1000 rows.

steve-chavez commented 5 years ago

It's all about expectation.. if I used count=estimated, I would think of it using the optimizer statistics

True, I agree. The special behavior with max-rows would be better separated.

Ok, estimated-overfetched it's up for implementation now!

steve-chavez commented 5 years ago

@LorenzHenk I've implemented estimated-overfetched in #1386. The behavior is as we agreed, using max-rows as the threshold and avoiding another config option.

Based on your python example, the logic I've added is like:

max_rows = 1000

# Use LIMIT maxRows + 1 as the threshold so we can determine below that maxRows was surpassed
count_threshold_result, page_total, header, body = do_the_query(max_rows + 1, limit, offset, your_query)

actual_count = count_threshold_result

if count_threshold_result > max_rows:
  estimated_rows = get_estimated_rows(your_query)
  if estimated_rows > count_threshold_result:
    actual_count = estimated_rows

I've now realized that this behavior is actually a good default. We could recommend its use.

However, the name estimated-overfetched seems more fit for a special use case. How about if we change your proposed names to these:

What do you think?

LorenzHenk commented 5 years ago

We could recommend its use.

If you want to recommend count=estimated (previously called count=estimated-overfetched), you should also warn about the caveats mentioned before as well. count=planned can be recommended without worries.

What do you think?

I'm fine with the name changes, go for it.

steve-chavez commented 5 years ago

@LorenzHenk Regarding the concern you mentioned before.

What I'm doing for the count query with the threshold is something like:

pg_source_threshold AS (
  SELECT 1
  FROM test -- this is the "test" view in your example that contains "long_running_function"
  LIMIT {threshold}
)

That SELECT 1 will prevent long_running_function(id) from getting called, if and only if the long_running_function is marked as STABLE or IMMUTABLE. Here's a gist that proves that.

In general I think we can recommend that is always best to label the functions with the strictest volatility.

One other way would be to use computed/virtual columns and avoid having the expensive function in the view.

steve-chavez commented 5 years ago

@LorenzHenk Finished the implementation in https://github.com/PostgREST/postgrest/pull/1386!

Also, based on our discussion, I added docs for planned count/estimated count(any correction is welcomed).

LorenzHenk commented 5 years ago

Thank you, this feature is very appreciated!