SoftInstigate / restheart

Rapid API Development with MongoDB
https://restheart.org
GNU Affero General Public License v3.0
807 stars 171 forks source link

Question > How to preload less than 1000 documents? #296

Closed Musikolo closed 6 years ago

Musikolo commented 6 years ago

Hi,

I have a question I would like to get some guidance for. I'm not sure if this is the best place to ask questions, but I haven't found any better place. I'll be happy to ask again wherever I'm instructed, if this is not the place for it. I've read the documentation carefully, but I didn't find the information I'm looking for.

Question: I've got a really big collection with 1.3 billion documents. There is a field that returns the documents that belong to each user. The number of documents per user could range from very few thousands to up to 80K. +98% of calls just want to show first 25 documents. There is a "More documents" button to get the next page.

I've noticed that RESTHeart preloads 1000 all the time, but this happens to be very expensive. I thought that was related to the default use of linear cursors. I've tried changing the value from 1000 to 100, but it didn't work for me.

So, how can I do for RESTHeart to preload a smaller number of documents, say 100, since in my application most users will be happy with the first 25 documents (page=1)?

Thank you!

mkjsix commented 6 years ago

Note that documentation on Confluence is no longer maintained. Please refer to restheart.org/learn

Cursor pools don't return documents, they serve a different purpose:

RESTHeart speedups the execution of GET requests to collections resources via its db cursors pre-allocation engine. This applies when several documents need to be read from a big collection and moderates the effects of the MongoDB cursor.skip() method that slows downs linearly. In common scenarios, RESTHeart’s db cursor pre-allocation engine allows to deliver brilliant performances, even up to a 1000% increase over querying MongoDB directly with its Java driver.

What you want is pagination:

Embedded documents are always paginated, i.e. only a subset of the collection’s document is returned on each request. The number of documents to return is controlled via the pagesize query parameter. Its default value is 100, maximum allowable size is 1000. The pages to return is specified with the page query parameter. The pagination links (first, last, next, previous) are only returned on hal full mode (hal=f query parameter); see HAL mode for more information.

You should have a look at:

ujibang commented 6 years ago

RESTHeart actually uses a mongodb cursor batch size of 1000, which is the maximum pagesize value.

See BATCH_SIZE in https://github.com/SoftInstigate/restheart/blob/master/src/main/java/org/restheart/db/CollectionDAO.java

This is needed to avoid poor performance with requests with high pagesizes, see https://github.com/SoftInstigate/restheart/issues/218

I guess that this value might be problematic with your huge collection.

What if we allow defining a maximum pagesize via configuration file, and also use this value as cursor batch size?

Musikolo commented 6 years ago

Hi guys,

First of all, thanks for your ultra-quick response. I really appreciate it.

Yes, I think that what I saw in the logs was a batchSize of 1000. So, it would definitely help to have this property configurable. At the very least, it will be useful to fine tune the performance of each application to their specific needs. It could keep the same value by default.

Assuming this change is simple to do, what is your best guess for a release with it? My current MongoDB version is 3.4.14. I hope there are no incompatibilities.

Thank you!

Musikolo commented 6 years ago

I was thinking about this matter, I want to point out something that might be interesting exploring.

Would be possible adding support for the limit(x) operation. I'm aware of the performance issues when paging with skip() and limit() operations, but for the very first pages of a cursor, it's an option that could be beneficial overall. Thoughts?

ujibang commented 6 years ago

The pagesize query parameter does exactly this, i.e. ?pagesize=100 results in .limit(100).

?page controls the skips.

Musikolo commented 6 years ago

Well, it's not exactly the same, because RESTHeart is using some built-in logic in combination with the default batch size (=1000) to do the paging. It's not using MongoDB native .skip(x).limit(y) operations. Anyway, it was just a thought I wanted to share.

I've cloned the code, changed the BATCH_SIZE constant to 100, and run a test with JMeter in my test environment. This is what I get after 30 minutes running:

With original BATCH_SIZE=1000: summary = 9804 in 00:30:52 = 5.3/s Avg

With custom BATCH_SIZE=100: summary = 30561 in 00:30:25 = 16.7/s Avg

This is more than 3 times faster for the test case I tested. So, I definitely think that having this property configurable it's really valuable.

Thank you so much for the great support!

mkjsix commented 6 years ago

@Musikolo we agree, we'll put this into the next minor release, in the following days.

ujibang commented 6 years ago

I just added 3 new options to the configuration file (default-pagesize,max-pagesizeandcursor-batch-size`) that tune the overall read performance according to expected pagesize.

The default values are 100, 1000 and 100 respectively. In your case where you expect 98% of request with pagesize=25 you could set:

default-pagesize: 25
max-pagesize: 100
cursor-batch-size: 25

see commit 20016ae83ca9749718ca7f63ecf24f299b46d56c

The default configuration "Read Performance" section follows:

## Read Performance

default-pagesize: 100

# default-pagesize is the number of documents returned when the pagesize query 
# parameter is not specified
# see https://restheart.org/learn/query-documents/#paging

max-pagesize: 1000

# max-pagesize sets the maximum allowed value of the pagesize query parameter
# generally, the greater the pagesize, the more json serializan overhead occurs
# the rule of thumb is not exeeding 1000

cursor-batch-size: 1000

# cursor-batch-size sets the mongodb cursor batchSize
# see https://docs.mongodb.com/manual/reference/method/cursor.batchSize/
# cursor-batch-size should be smaller or equal to the max-pagesize
# the rule of thumb is setting cursor-batch-size equal to max-pagesize
# a small cursor-batch-size (e.g. 101, the default mongodb batchSize)
# speeds up requests with small pagesize
Musikolo commented 6 years ago

Thank you so much for the prompt commit!

I understood what you meant your previous post, but I guess you meant I should use:

default-pagesize: 25
max-pagesize: 100
cursor-batch-size: 100

I'm looking forward to getting the next release available to test it out. Obviously, I prefer a clean and configurable solution like the one you just implemented, instead of something hard-code as I did.

Further testing difference in performance with a lower cursor batch size, this morning I could test in our production environment our new JAR I built, and the performance boost is event larger than in our test environment. The average throughput has gone from 30 req/s to ~ 230 req/s. This is more than 7.5 times faster!! To be honest with the performance outcome, we also upgraded the hardware to have a cluster with more RAM and CPU.

Thank you so much again for your great support!

mkjsix commented 6 years ago

We have just released RESTHeart 3.4.0.

Docker images are here.

mkjsix commented 6 years ago

@Musikolo In case you'll have time to write something, we are always looking for blog posts about RESTHeart, how people use it and why.