NikolayS commented 9 years ago

(This is related to #255)

In MANY cases the app is designed in the way where count is not needed at all. For example if you query items and show them to user in twitter/pinterest/facebook-like "infinite" list, when at the first step, only Top-N items are selected, then, for the second portion, again, Top-N items are selected but with some "shift" (using id or created column usually), and this process can be repeated many times.

This very common task. And usually you have many millions items in the table and query should be executed in milliseconds.

So in such cases count(1) is not needed at all. Moreover, estimated count is not needed either since the app doesn't show user any information about it (no items count, no page number). I implemented solutions with estimated count in a couple of large projects (dozens Ms of users) and unfortunately such approaches fail very fast – in cases when you have some filters where Postgres has no good statistics and makes error estimations (not to mention that it does require many additional milliseconds – see below). In many cases, it's better to show nothing to user than to show wrong information. So in my personal cases I decided to stop using estimated count for large datasets and design apps in a way where this info is not needed.

However, in some cases count(1) is useful of course and should be provided.

The main thing here is that for years Postgres has slow count(1). This is very serious problem and people who use Postgres know how to handle it and how to live with it.

What I want to say is that by default, count(1) is not needed. I propose to skip it by default and show only if user asks for it. This is separate issue than #255 since we would have 3 options:

no count
exact count
estimated count

.. and large projects need the first one to be used by default, otherwise performance issues will be real headache.

"Estimated count by default" is also no-good, since in loaded server it's better to avoid additional planner executions on every request. Here is a quick example for one real production server with 5k TPS, dozens of M of rows in the table:

XXXXXX=# explain select * from user where birthday <= '1980-01-01';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Seq Scan on "user"  (cost=10000000000.00..10000273915.79 rows=1926654 width=405)
   Filter: (birthday <= '1980-01-01'::date)
(2 rows)

Time: 83.222 ms
XXXXXX=# explain select * from user where birthday <= '1980-01-01';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Seq Scan on "user"  (cost=10000000000.00..10000273915.79 rows=1926654 width=405)
   Filter: (birthday <= '1980-01-01'::date)
(2 rows)

Time: 0.295 ms
XXXXXX=# explain select * from user where birthday > '2000-01-01';
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on "user"  (cost=10000000000.00..10000273915.79 rows=49839 width=405)
   Filter: (birthday > '2000-01-01'::date)
(2 rows)

Time: 24.530 ms
XXXXXX=# explain select * from user where birthday > '2000-01-01';
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on "user"  (cost=10000000000.00..10000273915.79 rows=49839 width=405)
   Filter: (birthday > '2000-01-01'::date)
(2 rows)

Time: 0.338 ms

– no one would be happy to have additional 20-80 ms (even though 2nd call is quicker, but it should be the same call and what if you have many possible filters?).

So I propose to disable count(1) by default. This will help to adopt this solution in mid-sized and big apps. The project itself is great and well designed and according my twitter, more and more people are interested in such solutions.

begriffs commented 9 years ago

You make good points. Seems like you've had some real production experience with postgresql-powered apps. I'm fine with getting rid of count by default. When I first wrote it I was unaware of the performance problems and thought I was getting "two queries for the price of one!"

Also can I get your opinion on https://github.com/begriffs/postgrest/issues/254 ?

NikolayS commented 9 years ago

Thank you. The "optimization 2" described in http://leopard.in.ua/2014/10/11/postgresql-paginattion/ is exactly what I talked about in the first paragraph.

This is widely used technique, with its drawbacks.

Cursor method I would rather consider for any heavy-loaded OLTP project. pgbouncer is a standard de facto right now, for connection pooling. I'm really not an expert in Postgres cursors topic, but AFAICS you're considering keeping connection between different requests – I'd expect problems with "statement" and "transaction" pgbouncer's pooling modes (which are usually used under heavy load). So cursors are maybe not the best choice for OLTP apps that use pgbouncer. Though they might be considered as a good thing in non-OLTP fields when low numbers of users work with large data sets.

Meanwhile, I wonder why are you at all thinking about implementing these tasks inside PostgREST. Is the pagination really a task that should be implemented by API engine? This can lead the project to narrowing use cases where it can be used – I'd stick to be as low-level as possible, leaving this task to the end user who can choose between true OFFSET and "order-and-shift" approaches for pagination. Both approaches could be easily achieved with eq and gte/lte abilities that are already provided. Maybe it's better to be as low-level as possible and think about providing general ability to work with cursors for those who can afford them in their projects allowing implementing pagination or whatever on top of it?

NikolayS commented 9 years ago

Maybe I explained my thoughts not so clear in the previous message. What I propose regarding pagination problem is to improve PostgREST's capabilities in more "low-level" fields. So, in examples mentioned above:

instead of providing "exact & straightforward pagination with OFFSET feature" simply provide LIMIT/OFFSET feature itself (already present);
instead of providing "useful and cute shift-and-order pagination" (like in http://leopard.in.ua/2014/10/11/postgresql-paginattion/), provide capability to use ordering and < / > comparisons (both already present)
instead of providing "smart calculated pagination notation" (which is X-Y/N now) provide low-level numbers -- actually, in general case it's only one number, "how many items returned in the resultset". If requested, provide also overall count number (right now, after thinking some time on this issue, I already really doubt that it is a good idea at all to combine count(1) with main SELECT itself – doing this, API is doing "foreigner" work. Why? It's simple. One more reason against it came to my mind: in such a strategy you don't allow user to cache count(1) result or to keep it in app's main memory to avoid redundant calls. The next step in improving things will be adding cache to PostgtREST. Or explain user how to ask it only once using some one more header or GET parameter... In other words, supporting this feature lead the project to more and more of new tasks that move you away from clear and basic value proposition, which is, I believe, "to provide lightweight, scalable, fast and easy-to-use tool to build API (and only API) on top of Postgres".)
[maybe] instead of providing "how many rows at all" count, provide a simple additional ability to query it explicitly – smth like /api_method?somefilter=eq.XXX&anotherfilter=lte.YYY&count=[exact|estimated] (I'm not sure that it's worth to make count a "reserved" word – sometimes people create views with count(*) in definition, and some people are lazy and don't define aliases, so result sets have count column in such views :) )
again, instead of doing implicit estimated count, provide capability to do explicit API call for it (as in example above)
instead of providing pagination via cursors feature, do something to allow users use any cursors via PostgREST, allowing them to decide what is the final purpose of doing it. (But I'm not sure how many users need it – I'd rather expect that pl-functions calls via PostgREST and rich features around it are more important. I'm by no doubts going to try plpgsql and maybe plv8 functions calls via PostgREST. I saw some warning about /rpc calls but not sure what's wrong with it...)

I hope the general idea of being "as low-level as possible" is clear now. It is always good to do 1 thing better than anyone else, with excellent quality – than to do many things with so-so quality.

NikolayS commented 9 years ago

Edited. Sorry for mess in my English.

jmealo commented 9 years ago

I generally agree with @NikolayS about keeping things low-level. I'd like to see robust RPC and being able to fully exploit plv8 before we get into features that not everyone wants/needs. Once people start posting examples using plv8 and RPC to do advanced things with PostgREST I think that will get people thinking outside the box and give them the tools they need to work around nearly any limitation.

@begriffs Have you considered using a configuration file and/or request headers to enable/disable controversial features (high performance penalty and/or limited use case)? I personally would prefer anything with such a large performance impact to be an "opt-in" but I can see the value in "sane defaults".

If a client side dev knows how to send custom HTTP headers, I'd venture to say they know how implement pagination. As you said before, giving a way to retrieve count without any results is valuable independent of this discussion.

Additionally, when using RLS, we have to reconsider what we know about query performance (including count) in PostgreSQL and make sure we try and retry our previously held assumptions. I have found that enforcing ACL in white label or complex organizational scenarios can be the most expensive part of queries.

Big picture stuff

The beauty of PostgREST is showing what we can accomplish in the database without heavy middleware and/or app layers. We need to see good working examples of table partitions coupled with triggers, RLS, and materialized views (pop redis-fdw in there for the cool kids).

I think that it's hard to realize how all of these pieces can fit together without seeing it. To help beginners it might help to say "This works best with write once tables" or "This doesn't work well with tables that are updated often", etc... etc...

This is why I lean towards low-level and highly flexible. What if I want my counts to come from a materialized view? How can I do that without writing Haskell?

NikolayS commented 9 years ago

Any new on this? For myself, I just commented the count(1) parts of SQL and recompiled it, but for the project itself smth different is needed

begriffs commented 8 years ago

So you can now suppress the count but by default it is calculated included. In v0.3 how about we switch the default to no count while allowing a client to send Prefer: count=exact? Later we can add count=approximate as well.

begriffs commented 8 years ago

Reviewing this issue I realize that counts are still included by default 0.3. Oops, I made an oversight. :cry:

This change will semantically require a major version bump since it can break clients. I'll add it to the newly created v0.4 milestone.

NikolayS commented 8 years ago

There is a collision between Prefer: count=none and Prefer: return=representation – if you always want "count=none" by default and put it to your code library which talks to API, and at the same time want to tell API to return just-created-entities, sooner or later you will find yourself (or your client-side developer as in my case) having headache patching client code logic (react & react-native in my case).

I'm still 100% sure that "count=none" should be default logic since in big project it's the only way to survive.

We ended up with this (excessive for some queries) solution – we add to all queries:

Prefer: count=none; return=representation

The bottom line is: current approach of counts enabled by default and collision with header naming for different instructions makes PostgREST harder to understand and introduces possible point of errors.

P.S. any ideas why adding Prefer: count=none; return=representation to all queries might be a bad thing?

NikolayS commented 8 years ago

My own "oops" ;-) – I've just read that count=none by default is planned for v0.4. 👍

ruslantalpa commented 8 years ago

@NikolayS you can add those headers on the fly in nginx and remove them from client code. Might be even more secure since the clients won't be able by mistake to do counts

NikolayS commented 8 years ago

so this will be included to v0.4? when does it expected?

begriffs commented 8 years ago

Yep it'll be in 0.4 along with some other things that are breaking changes.

PostgREST / postgrest

Get rid of count(1) by default #273

Big picture stuff