Maintain a request session to optimize performance

mphuff commented 4 years ago

The Planning Center API is exposed over an HTTP/2 interface. The requests library is designed to take advantage of HTTP/1.1 technologies.

As such, I see two options for taking advantage of the request session:

Leveraging the Hyper (https://github.com/python-hyper/hyper) library and their corresponding request adapter for requests library
Use the internal connection pooling library from urllib3 (under the hood in requests)

(2) is less of an ideal situation given that HTTP/2 is built for multi-plexing and majority of use-cases for this library will involve some series of requests. (https://blog.newrelic.com/engineering/http2-best-practices-web-performance/)

pastorhudson commented 4 years ago

I’m not convinced requests isn’t good enough. But if there’s interest in other libraries I think httpx is popular at the moment. https://github.com/encode/httpx

It also supports async and await so it could offer some optimization/complexity there.

mphuff commented 4 years ago

I’m not saying requests isn’t good enough per-say and we could totally leverage the old HTTP/1.1 Keep-Alive... and the underlying connection pooling capability in requests will work.

I’ll keep it like that for now and add the necessary changes and test it out… Historical experience expects about 2-3x performance improvement

_{Sent with GitHawk}

pastorhudson commented 4 years ago

2-3x sounds pretty sweet.

billdeitrick commented 4 years ago

@mphuff You raise some great points here. I didn't realize that the PCO API was available over an HTTP/2 interface. 🙂

At minimum, I think you're absolutely right that we need to enable connection pooling. Considering we're rate limited to 100 request/20 seconds, I'm not sure how much of a difference this will make for large batch operations--from my experience I would think rate limits would generally dwarf any transport-level issues. However, for small/short bursts of activity that don't actually hit the rate limit, I would definitely think that this will make an appreciable difference.

Regarding http/2, I don't think there's a library for Python that is out of alpha or beta status yet. With that in mind, I think I'd like to stay with requests and http/1.1 until we have an http/2 library available that's stable. Then it probably makes sense to move (and might involve breaking changes, which would push us to a v2 release of the wrapper library).

The requests session object supports connection pooling using urllib3 under the hood, so I went ahead and pushed up a branch that's using this: https://github.com/billdeitrick/pypco/tree/keepalive-testing

I need to do some more testing to see what sort of performance gains this gives us and make sure it's actually pooling connections as expected. It would be great if either or both of you wanted to test this too and let me know what you find out.

billdeitrick commented 4 years ago

Ok, so here are some initial results. For these tests, I'm making 99 get requests against a single person object in people (each result is avg. of 5 runs):

Current release: 18.02s HTTP/1.1 Connection Pooling (requests session): 8.05s HTTP/2 via hyper's requests adapter: 9.36s (for the HTTP/2 test I got one result that was significantly higher; I threw that one out because it seemed like there may have been something else at play)

Tomorrow I'm going to test with a larger number of requests (I'm expecting we won't see as much of a difference once rate limiting comes into play).

TL;DR: enabling connection pooling was trivial, and we need to do it.

mphuff commented 4 years ago

@billdeitrick I came to realize that the rate limiting on the PC api actually was the bulk of my timeliness issue but nonetheless I am seeing similar results to what you're seeing. My ETL was running ~190s w/o connection keep-alive and ~115s w/ connection keep-alive.

To your point it isn't as drastic as your results but similar enough that it is still worth the effort.

I made the exact same change you did when I had it up and running locally so at least we know the results are showing consistently between both of us.

pastorhudson commented 4 years ago

Is the rate limit tied to the IP address or the api key?

billdeitrick commented 4 years ago

Just released v1.1.0 which includes this! Thanks @mphuff!

@pastorhudson I think the docs say that rate limiting is per user: https://developer.planning.center/docs/#/introduction/rate-limiting

billdeitrick / pypco

Maintain a request session to optimize performance #34