cloudant / python-cloudant

A Python library for Cloudant and CouchDB
Apache License 2.0
163 stars 55 forks source link

Query too slow compared to curl #389

Closed VersusF closed 6 years ago

VersusF commented 6 years ago

I'm trying to execute a query, but it takes more than a minute using Query class. The same query is instead very fast (2-3 sec) if executed with curl

I actually do another query with your library but it's very fast. Hope you can solve, thank you for all the brilliant work done.

ricellis commented 6 years ago

I notice that your curl query is using a limit of 5000, but that you don't have the same limit applied to your python-cloudant query. I wouldn't expect it to make a difference, but it would be good to know that we are comparing exactly the same query before digging in any deeper.

VersusF commented 6 years ago

Thank you for the fast answer: I added the limit parameter to the query, but in order to iterate on the results i had to call Query(...).result.all(). It slowed down the total time to 3 sec (just like the curl version) but maybe this is not the optimal solution. From profilehooks output now there are 57 calls to {method 'recv' of '_socket.socket' objects}, with an average time of 0.054, so that's probably the point.

Anyway thank-you again for the hint and the support.

ricellis commented 6 years ago

We normally expect queries to be executed via: db.get_query_result(...)

There are some documented examples of iterating docs returned from it.

Note that the default page_size is 100, and that skip and limit are set internally based on this to page results in a memory efficient way (at the expense of making more requests). If you are unconcerned about the memory usage of having more docs at once (especially because you are only using the _id field) you can increase the page_size, reducing the number of requests needed. I suspect that your use of the (internal) Query class above used the limit=100 and explicitly setting the limit higher reduced the number of requests needed to fetch all the query results. It is better to use the db.get_query_result as that's really the API for this, for your case something like this should work:

docs = db.get_query_result(selector, fields=["_id"], page_size=5000, use_index="78efbd1fbc0663b7953309184e9c6b3b0c1ca965")
for doc in docs:
    # count
VersusF commented 6 years ago

This is probably the solution. I did not know about get_query_result method. changing page_size i actually change the query time too. Thank you again for the support, I think now this issue can be deleted.