Connection time-out with list_all('schedules') with thousands of schedules.

PagerDuty / pdpyras

Low-level PagerDuty REST/Events API client for Python

MIT License

129 stars 29 forks source link

Connection time-out with list_all('schedules') with thousands of schedules. #64

Closed JamesKIBM closed 3 years ago

JamesKIBM commented 3 years ago

Using: session.list_all('schedules')

I get:

...
HTTP or network error: ReadTimeout: HTTPSConnectionPool(host='api.pagerduty.com', port=443): Read timed out. (read timeout=5); retrying in 12 seconds.

Until it errors for exceeding the maximum number of attempts to connect to the API.

This only happens when making the request on a PagerDuty account with thousands of schedules. The same request works fine with an account with only hundreds of schedules. Also, the same style request works fine for other endpoints (objects).

The same call using requests.get() with paging works fine. I couldn't replicate the issue after copying headers and so on.

Thank you.

Deconstrained commented 3 years ago

Hi @JamesKIBM ,

Just to be sure, can you confirm that there's no timeout when explicitly setting the page size to 100 in the parameters (limit=100) during a direct requests.get versus making the request with APISession.iter_all?

The error message was produced by the underlying HTTP client, and pdpyras does not change the read timeout setting, so I am inclined to think that it has more to do with the server taking a much longer time to respond for larger page sizes (the default in iter_all is 100, whereas the default when unspecified is 25).

JamesKIBM commented 3 years ago

Hi @Deconstrained,

My workaround with requests.get does use limit=100 and there is no issue with it.

Deconstrained commented 3 years ago

@JamesKIBM Thank you for confirming! I'm looking into reproducing this issue.

Deconstrained commented 3 years ago

@JamesKIBM I was able to reproduce the timeout issue in a large test account. However, I was not able to reproduce the condition where making a single request to the schedules index via session.get('schedules', params={'limit':100}) does not result in a timeout whereas using iter_all does. Both methods resulted in network read timeout.

Something I think might help here, that I'd be glad to implement, is a new property in the session class to specify a custom timeout value for requests.Session.request, seeing as the default is just five seconds. There is not yet any way of setting a timeout in iter_all, although one can pass it in to APISession.get as a keyword argument that will then get passed to requests.Session.

JamesKIBM commented 3 years ago

@Deconstrained To clarify, if I use requests.get() with limit=100 it works fine. And if I use session.rget('schedules') with the default of limit=25 it works fine. But I just tested session.rget('schedules', params={'limit': 100}), and it fails like it did for you!

Just tested your proposal, and doing:

pdpyras.TIMEOUT = 30
session.rget('schedules', params={'limit': 100})

Does work! However, this seems like an unusually long time for a connection request, right? A workaround is to extend the TIMEOUT value, but the solution may be for PagerDuty to see why the /schedules end-point has a meaningfully longer connection time than other end-points. (A quick read shows me that requests does not have a default timeout value for connections to respond.)

Deconstrained commented 3 years ago

@JamesKIBM Thank you! I recognize this as a scalability issue as well as a limitation in the client. I've documented and brought to the attention of the team owning the Schedules API this need for better scalability in larger accounts.

Final questions: do you use this library for Python 2.7, or is this issue severely impacting any of your workflows? If so, it would be easy to release one final version with the configurable timeout before support for 2.7 ends on June 21st.

JamesKIBM commented 3 years ago

@Deconstrained I haven't used Python 2.7 in a while! No worry there.

Removing or increasing the TIMEOUT in the request works fine for now. Hopefully the team improves on that end-point's connectivity, as well.

I'll go ahead and close the issue - feel free to comment/reopen. Thank you for the support and helpful library!

Deconstrained commented 3 years ago

My goodness, I was obtuse at first about the preexisting TIMEOUT property and hadn't even remembered that feature already being there. That was @ctrlaltdel 's contribution (#48). At any rate, I'm glad that there is a working solution!