Closed ajdavis closed 9 years ago
Hi ajdavis,
The same mongodb manual page states:
Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page
You can achive this using the filter query parameter, thus avoiding mongodb to walk the entire index.
In any case, you suggestion to somehow caching data goes in the right direction. On the other hand, it is mandatory to keep restheart completely steteless in order to easily scale it horizontally; consider that a collection keeps changing due to writes and we'll need to invalidate cached objects not only in the restheart instance that actually executed the update, but in all cluster's instances.
In order to achieve this, we plan to add a distributed cache between restheart and mongodb thus avoiding queries to actually get executed every time.
This will achieve a terrific performance improvement.
Any feedback?
Thank you for you help.
A cache certainly could help, assuming that a cached query with a large skip is constant-time rather than linear. Depends on all the things that cache performance usually depends on, but you know your expected usage much better than I do. =)
I wonder: if a client asks for the first 10 results, should you cache many more results from the query, anticipating that this client (or some other) is likely to ask for other pages in the same result set?
These are reasonable workarounds, but my general feedback is this: you're creating an API that encourages paging with skip and limit, which MongoDB is especially weak at. Solving this problem isn't this project's raison d'être, but this problem is going to take up a large portion of your project's effort for the rest of your life. Do you want to make this commitment now? I suggest dropping support for paging now, while there's still time, and finding an alternative design instead.
Hi,
I got your point and I spent some time on this issue.
I just pushed some exepimental code in the feature/issue-1 branch: there, the CollectionDAO uses a DBCursor Pool, trying to reuse cursors that have been already skipped based on previous requests. Preliminary tests show an impressive improvement on page GETs with big skips, despite the simple cursor pre-instantiation policy (I'll share some test data soon).
However I need to better understand how the cursors in the pool behave in regards of concurrent inserts.....
Cool, that sounds great. Two things to note:
done!
the eager cursor preallocation engine is now part of the development branch and addresses this issue. some tests show a 10x performance improvement factor! (we are going to post details about this in our blog soon).
this enhancement will be part of upcoming 0.9.8 release.
thanks for guiding us in the right direction.
Cool, glad to help!
FYI: we just published some performance test results.
in most of the test scenarios, RESTHeart achieves better performances than a direct access with the MongoDb java driver!
Notably, on finds that involve skipping many documents, as you pointed out, MongoDB has poor performances while RESTHeart, with its eager preallocation dbcursors engine, does very well.
Instead of skip : we should go with range based example:-
a) If we want to see records without any sorting than
we should use (_id: {$gt : <lastId>})
in where clause.
b)if we want to see sorted results:- example:- _id value 1 a 2 b 3 c 4 a 5 b 6 c than approach will fails. so to handle this we have to modify our where clause just as below:-
db.<collection>.find(
{$or:
[
{$and :
[
{<sort key>:{$gte: <last sort key value of previous page>}},
{"_id" : {$gt : <last result id of previous page>}}
]
},
{<sort key> :{$gt: <last sort key value>}}
]
}.
sort({<sort key>:1}).limit(10)
This will work for you 👍 .... I have written a demo code where i have implemented this with application level cache to maintain last page info.
Hi @AjitDeswal
when dealing with large collections, range based paging is indeed the best approach; this is also suggested in the mongodb documentation:
Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page.
Now, RESTHeart returns paged results when GETting a collection; when a client requests GET /db/collection?page=X
the cursor pool feature discussed here, pre-allocates cursors to speed up requests for pages near X following the "Principle of locality".
This is happening despite the fact that the client is specifying or not a range based query.
So for instance:
GET /db/collection?filter={"$and":[{"a": {"$gt": 100}}, "{"a": {"$lt: 200} }]}&sort={"a": 1}
leads to preallocating cursors with same filter and sort options.
These preallocated cursors will speeded up:
GET /db/collection?filter={"$and":[{"a": {"$gt": 100}}, "{"a": {"$lt: 200} }]}&sort={"a": 1}
GET /db/collection?page=2&filter={"$and":[{"a": {"$gt": 100}}, "{"a": {"$lt: 200} }]}&sort={"a": 1}
But the following request will not be speeded up:
GET /db/collection?filter={"$and":[{"a": {"$gt": 200}}, "{"a": {"$lt: 300} }]}&sort={"a": 1}
As long as I understand you are proposing something like:
GET /db/collection?page=X&rangePaging={"$and":[{"a": {"$gt": "$page*100"}}, "{"a": {"$lt: {"$page*100+100"}} }]}&sort={"a": 1}
The rangePaging parameter would tell RESTHeart how to page the collection, and the cursor pool would preallocate cursors accordingly.
In essence, we don't use skip anymore but pure range based paging...
Have I got your point?
MongoDB queries with a large "skip" slow down linearly. As the manual says:
I'm concerned that RESTHeart's API encourages users to page through large result sets using large "skip" values, see CollectionDAO line 185. Paging through an entire result set this way is quadratic.
Instead of creating a new DBCursor object with a "skip" for each HTTP request, I suggest keeping the DBCursor object in memory and iterating it to return the next page of data. You can evict a partially-iterated cursor from memory after 10 minutes of inactivity since it will be deleted on the MongoDB server after 10 minutes. This will give optimal (linear) performance to iterate an entire result set.