Adding parallel calls for iter_all

scott-shields-github commented 3 years ago

When attempting to retrieve all records from an endpoint with a large record set, the iter_all function takes a good deal of time while synchronously making the api calls. With this change, all calls after the first request are parallelized to speed things up. A new parameter, max_threads, was created to allow for customization for the callers environment/needs. In order to ensure backwards compatibility, we set max_threads with a sensible default and we still return an iterator.

Examples Before:

>>> print(timeit.timeit(setup="from pdpyras import APISession;p = APISession(<TOKEN>)", stmt="s = list(p.iter_all('services'));print(len(s))", number=1))
1185
12.433564969999992
>>> print(timeit.timeit(setup="from pdpyras import APISession;p = APISession(<TOKEN>)", stmt="s = list(p.iter_all('services'));print(len(s))", number=5))
1185
1185
1185
1185
1185
59.51973305299998

After:

>>> print(timeit.timeit(setup="from pdpyras import APISession;p = APISession(<TOKEN>)", stmt="list(p.iter_all('services', max_threads=10))", number=1))
3.109743124000005
>>> print(timeit.timeit(setup="from pdpyras import APISession;p = APISession(<TOKEN>)", stmt="s = list(p.iter_all('services', max_threads=8));print(len(s))", number=5))
1185
1185
1185
1185
1185
15.51525755900002

Deconstrained commented 3 years ago

Hello @scott-shields-github, this looks brilliant! Sorry I've not gotten around to reviewing and testing. This weekend and in down time before then I plan to devote a few hours to upkeep of this project; more comments to follow.

Deconstrained commented 3 years ago

Hi @scott-shields-github,

I again apologize for not having made more time for this. I've been reviewing it in more detail and there are a few things that might cause issues downstream.

It looks like iter_all's behavior fundamentally changes in a way that may cause issues with certain use cases. Fetching entire datasets may happen faster, but consider for example performing a search for an incident log entry in the recent past that might be somewhere in the neighborhood of offset=500&limit=100. If using the parallelized fetch-all-and-then-iterate approach, even thought it may fetch the entire dataset faster, log entries easily can number in the thousands, which means that the call to iter_all could hang for a very long time before yielding anything. Whereas, fetching the next page on-demand when the current page's results have all been yielded will mean only having to make six GET requests in this scenario.

It would be a safer approach, I believe, to make this as a separate method, and then refactor list_all, dict_all etc. to use the new parallelized index fetch method, because the assumption in those methods is that the full dataset should be fetched anyways. If you have time, the client would greatly benefit from such an improvement.

Another option is we could try to come up with a way to asynchronously fetch, but it might get complicated, e.g. yielding in the correct order.

scott-shields-github commented 3 years ago

No worries, yeah that makes sense. I'll work on moving this to it's own method for those reasons.

PagerDuty / pdpyras

Adding parallel calls for iter_all #43