SoftInstigate / restheart

Rapid API Development with MongoDB
https://restheart.org
GNU Affero General Public License v3.0
805 stars 171 forks source link

Poor performance paging through large result sets #1

Closed ajdavis closed 9 years ago

ajdavis commented 9 years ago

MongoDB queries with a large "skip" slow down linearly. As the manual says:

The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

I'm concerned that RESTHeart's API encourages users to page through large result sets using large "skip" values, see CollectionDAO line 185. Paging through an entire result set this way is quadratic.

Instead of creating a new DBCursor object with a "skip" for each HTTP request, I suggest keeping the DBCursor object in memory and iterating it to return the next page of data. You can evict a partially-iterated cursor from memory after 10 minutes of inactivity since it will be deleted on the MongoDB server after 10 minutes. This will give optimal (linear) performance to iterate an entire result set.

ujibang commented 9 years ago

Hi ajdavis,

The same mongodb manual page states:

Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page

You can achive this using the filter query parameter, thus avoiding mongodb to walk the entire index.

In any case, you suggestion to somehow caching data goes in the right direction. On the other hand, it is mandatory to keep restheart completely steteless in order to easily scale it horizontally; consider that a collection keeps changing due to writes and we'll need to invalidate cached objects not only in the restheart instance that actually executed the update, but in all cluster's instances.

In order to achieve this, we plan to add a distributed cache between restheart and mongodb thus avoiding queries to actually get executed every time.

This will achieve a terrific performance improvement.

Any feedback?

Thank you for you help.

ajdavis commented 9 years ago

A cache certainly could help, assuming that a cached query with a large skip is constant-time rather than linear. Depends on all the things that cache performance usually depends on, but you know your expected usage much better than I do. =)

I wonder: if a client asks for the first 10 results, should you cache many more results from the query, anticipating that this client (or some other) is likely to ask for other pages in the same result set?

These are reasonable workarounds, but my general feedback is this: you're creating an API that encourages paging with skip and limit, which MongoDB is especially weak at. Solving this problem isn't this project's raison d'être, but this problem is going to take up a large portion of your project's effort for the rest of your life. Do you want to make this commitment now? I suggest dropping support for paging now, while there's still time, and finding an alternative design instead.

ujibang commented 9 years ago

Hi,

I got your point and I spent some time on this issue.

I just pushed some exepimental code in the feature/issue-1 branch: there, the CollectionDAO uses a DBCursor Pool, trying to reuse cursors that have been already skipped based on previous requests. Preliminary tests show an impressive improvement on page GETs with big skips, despite the simple cursor pre-instantiation policy (I'll share some test data soon).

However I need to better understand how the cursors in the pool behave in regards of concurrent inserts.....

ajdavis commented 9 years ago

Cool, that sounds great. Two things to note:

  1. Cursor timeouts. MongoDB supports a timeout policy of 10 minutes or never. I suggest sticking with the default 10-minute policy and be prepared to catch a CursorNotFound exception if there's 10 minutes of inactivity on a cursor. (The downside of the "never" policy is it commits server resources to the cursor until the cursor is fully iterated or manually closed, or the server is rebooted. That's risky if your application server is killed while cursors are open on the server.)
  2. Insertions: if you're iterating a sorted cursor that uses an index for the sort, then a document inserted after the cursor's position will appear in the results. A document inserted before the cursor's position is not included, and it doesn't affect the cursor's position or meaning of the current "skip". If you're iterating a sorted cursor without an index, then I believe that a document inserted after the cursor begins iterating does not appear in the results, period. If you're iterating an unordered cursor, then it's undefined whether an inserted document is included or not--depends on the location of the free slot MongoDB chooses to store the new document.
ujibang commented 9 years ago

done!

the eager cursor preallocation engine is now part of the development branch and addresses this issue. some tests show a 10x performance improvement factor! (we are going to post details about this in our blog soon).

this enhancement will be part of upcoming 0.9.8 release.

thanks for guiding us in the right direction.

ajdavis commented 9 years ago

Cool, glad to help!

ujibang commented 9 years ago

FYI: we just published some performance test results.

in most of the test scenarios, RESTHeart achieves better performances than a direct access with the MongoDb java driver!

Notably, on finds that involve skipping many documents, as you pointed out, MongoDB has poor performances while RESTHeart, with its eager preallocation dbcursors engine, does very well.

AjitDeswal commented 7 years ago

Instead of skip : we should go with range based example:- a) If we want to see records without any sorting than we should use (_id: {$gt : <lastId>}) in where clause.

b)if we want to see sorted results:- example:- _id value 1 a 2 b 3 c 4 a 5 b 6 c than approach will fails. so to handle this we have to modify our where clause just as below:-

db.<collection>.find(
{$or:
    [
        {$and : 
            [
                {<sort key>:{$gte: <last sort key value of previous page>}},
                {"_id" : {$gt : <last  result id of previous page>}}
            ]
        },
        {<sort key> :{$gt: <last sort key value>}}
    ]
}.
sort({<sort key>:1}).limit(10)

This will work for you 👍 .... I have written a demo code where i have implemented this with application level cache to maintain last page info.

ujibang commented 7 years ago

Hi @AjitDeswal

when dealing with large collections, range based paging is indeed the best approach; this is also suggested in the mongodb documentation:

Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page.

Now, RESTHeart returns paged results when GETting a collection; when a client requests GET /db/collection?page=X the cursor pool feature discussed here, pre-allocates cursors to speed up requests for pages near X following the "Principle of locality".

This is happening despite the fact that the client is specifying or not a range based query.

So for instance:

GET /db/collection?filter={"$and":[{"a": {"$gt": 100}}, "{"a": {"$lt: 200} }]}&sort={"a": 1}

leads to preallocating cursors with same filter and sort options.

These preallocated cursors will speeded up:

GET /db/collection?filter={"$and":[{"a": {"$gt": 100}}, "{"a": {"$lt: 200} }]}&sort={"a": 1} GET /db/collection?page=2&filter={"$and":[{"a": {"$gt": 100}}, "{"a": {"$lt: 200} }]}&sort={"a": 1}

But the following request will not be speeded up:

GET /db/collection?filter={"$and":[{"a": {"$gt": 200}}, "{"a": {"$lt: 300} }]}&sort={"a": 1}

As long as I understand you are proposing something like:

GET /db/collection?page=X&rangePaging={"$and":[{"a": {"$gt": "$page*100"}}, "{"a": {"$lt: {"$page*100+100"}} }]}&sort={"a": 1}

The rangePaging parameter would tell RESTHeart how to page the collection, and the cursor pool would preallocate cursors accordingly.

In essence, we don't use skip anymore but pure range based paging...

Have I got your point?