cloudant / python-cloudant

A Python library for Cloudant and CouchDB
Apache License 2.0
163 stars 55 forks source link

proposal: a faster result iteration #436

Closed aogier closed 5 years ago

aogier commented 5 years ago

Please read these guidelines before opening an issue.

Bug Description

Iterating over a large dataset is sometimes slow. This is not a library disfunction but is related to how couchdb works, ie.

# take an acceptable time
SKIP=0
time curl -sH 'Content-Type: application/json' \
    http://localhost:5984/db/_design/dd/_view/vv\?limit\=500\&skip\=$SKIP\&group\=true

# this will take ages
SKIP=820000
time curl -sH 'Content-Type: application/json' \
    http://localhost:5984/db/_design/dd/_view/vv\?limit\=500\&skip\=$SKIP\&group\=true

So this is inherently true for your library when using current Result class logic. I've noticed that the different pagination approach manual propose does not incur in this slowdown.

I drafted a switch implementation that can allow to iterate based on startkey instead of skip and you can see it here, as you see is quite simple and yes is badly coded :grimacing: because is a PoC and I left original implementation under the new alternative.

What do you think? This is maybe somehow related to #389 or not, anyway for me it's a complete game changer that I'd like to have upstream.

Thank you, regards

smithsz commented 5 years ago

Hey, your proposal looks good. Would it be possible to amend the implementation slightly so that you get an extra result each time and use that to set the next startkey?

(Obviously returning all results if it's the last page, i.e. len(result) < limit)

This would mimic what we do in java-cloudant, see here.

aogier commented 5 years ago

ah sure you're right, dicts are passed by reference and user could change it when consuming generator!

Ok I can work on a real implementation, do you think this should become the default (only) behaviour or the limit/skip approach has to stay ? I'd personally go for the latter, since I'm constantly hitting this issue, even if as per this couchdb bug (last comments lead here) problem should not happens in post-1.2 releases... the only downside I can find with a startkeyimplementation is the longer url one have to fetch (more data on the wire etc).

smithsz commented 5 years ago

PR looks good. I think we can make this the default behaviour. It should be a big performance improvement when iterating over a large result set.

aogier commented 5 years ago

great! Thank you!