Getting the number of documents in a collection: slow query with the _size keyword

tsakiridis commented 3 years ago

Thanks for this great project.

In my team, we want to count the number of documents in a collection. To do so, we use the _size keyword e.g. GET /collection/_size, per this. But as the number of documents in the collection increased, the GET request takes increasingly longer to complete.

Expected Behavior

The _size keyword should return relatively quickly the number of documents. Or, there should be a way to quickly get the number of documents in a collection (without any filtering).

Current Behavior

We have approximately ~2500 documents of ~5 MB each, and the GET request to the _size keyword takes approximately 30 to 40 seconds to complete. When connected to the mongo console, db.collection.count() finishes almost instantaneously.

Context

We are trying to get the count of all documents in the collection (without any filtering). The slow response time incurs delays in our processes.

Environment

In our development and production environments we use RESTHeart 5.1.5. We've tested this using RESTHeart 5.4.1 to connect to the same db and the delay is the same. Operating System is Ubuntu 20.04 LTS (with all updates), Mongo v4.4.6.

Steps to Reproduce

Not relevant? Let me know if you need any further info.

Possible Implementation

Are we using the wrong endpoint to get the number of documents in a collection? Is such a large delay normal?

ujibang commented 3 years ago

Can you please try specifying the pagesize=0 query parameter as follows:

GET /coll/_size?pagesize=0

The reason is that the GET /coll/_size request returns also 100 documents (in your case, this is 500Mbyte of data!!!) when using the HAL representation format as in GET /coll/_size?rep=hal

This is why internally RESTHeart also retrieves the documents, that are not included in the response in case of the standard representation.

If you confirm that this works, we will fix this bug by omitting this redundant data retrieval when using the standard representation.

tsakiridis commented 3 years ago

Thanks very much for the prompt reply, I appreciate it.

Unfortunately this does not work. Testing from localhost for example yields the following line in the log file:

13:09:17.349 [XNIO-1 task-2] INFO org.restheart.handlers.RequestLogger - GET http://localhost:8082/collection/_size?pagesize=0 from /127.0.0.1:43250 => status=200 elapsed=38765ms contentLength=14 username=admin roles=[admin]

The request still takes ca. 39 seconds to complete. I can only assume that internally in this request RESTHeart probably queries the data before returning only {"_size":2608}, which is the size of our collection.

I also tried various combinations of _size with projection to only perhaps GET one key from the document (e.g. the id), but as I was just shooting in the dark it was in vain.

Is there anything else I can try?

Thanks again.

ujibang commented 3 years ago

Hi @tsakiridis,

I double checked and the potential problem I was describing (the need of ?pagesize=0) is actually not a problem.

The request GET /coll/_size does not retrieve any document from mongodb, it only executes coll.countDocuments()

This is very weird, counting 2600 docs should not require that much!

Can you confirm that you are not specifying a filter on your request (you want to count all documents)?

tsakiridis commented 3 years ago

Thanks for taking a closer look into this.

Can you confirm that you are not specifying a filter on your request (you want to count all documents)?

Yes, indeed, I confirm this. I test this also by running a curl command on localhost like so:

curl --user admin:secret http://localhost:8082/collection/_size

We only wish to count all documents in this collection without any filtering whatsoever.

On other, smaller, collections this command executes very fast.

As a side note, on another collection with ca. 1500 documents of ~1MB each (on average) this takes roughly 8 seconds to complete.

Let me know if there is anything else I can provide you with to assist you.

ujibang commented 3 years ago

Hi @tsakiridis,

from the mongo shell can you please try:

db.collection.countDocuments()

and compare the results with:

db.collection.count()

RESTHeart uses db.collection.countDocuments() that (from mongodb doc page):

Unlike db.collection.count(), db.collection.countDocuments() does not use the metadata to return the count. Instead, it performs an aggregation of the document to return an accurate count, even after an unclean shutdown or in the presence of orphaned documents in a sharded cluster.

tsakiridis commented 3 years ago

Hello, and thanks again for investigating this. Indeed this seems to be the case causing the delay.

db.collection.count() is instantaneous, whereas db.collection.countDocuments({}) takes approximately ~37 seconds for our collection to return the true size. I also confirm that db.collection.aggregate([ { $count: "myCount" }]) per this also takes ~37 seconds to complete in our collection (even though we don't use a sharded cluster).

Per the documentation of mongodb I understand that it says that db.collection.count() should be avoided for the same reasons you quote above, but at least in our use case the approximate number of count based on the metadata is good enough.

Is it possible to add a query argument to specify if _size should return this approximate count based on the metadata?

ujibang commented 3 years ago

Yes we'll do it (in 6.x e 5.x branches).

something like

GET /coll/_size?estimate

this will use db.collection.estimatedDocumentCount() rather then db.collection.countDocuments(). See count documents java driver doc page

Note: estimatedDocumentCount() returns an estimation of the number of documents in the collection based on the collection metadata. You cannot specify a query when using this method.

ujibang commented 2 years ago

Added to RESTHeart v7 roadmap https://github.com/SoftInstigate/restheart/discussions/425

ujibang commented 2 years ago

After some analysis we found an easier way to get the estimate count that won't require modifying the API.

A simple aggregation using the $collStats operator can be defined:

{
  "aggrs": [
    {
      "stages": [{ "$collStats": { "count": {} } },  { "$project": { "count": 1 } } ],
      "type": "pipeline",
      "uri": "estimateCount"
    }
  ]
}

This allows to get the estimate count, e.g.

$ http -a admin:secret :8080/coll/_aggrs/estimateCount

[
    {
        "count": 94
    }
]

SoftInstigate / restheart