celiao / tmdbsimple

A wrapper for The Movie Database API v3.
GNU General Public License v3.0
579 stars 122 forks source link

HTTP requests concurrency problems v2.7.0 #77

Closed Archmonger closed 3 years ago

Archmonger commented 3 years ago

urllib3 by default limits concurrent requests to 10.

So for tmdbsiple, if you have multiple threads executing, you'll get WARNING urllib3.connectionpool: Connection pool is full, discarding connection.

I'd assume that it's related to the new global REQUESTS_SESSION, as this wasn't an issue on v2.6.0.

p0psicles commented 3 years ago

tmdbsimple was already using a session from requests. The change now only made it possible to optionally provide your own session object

Archmonger commented 3 years ago

Just looked over the code for base.py, you're right. I'll look over a diff between 2.6.0 and 2.7.0 to trace down why those urlib3 warnings don't occur on 2.6.0 for me.

Is there any documentation on how to provide my own requests session?

Archmonger commented 3 years ago

I've gone ahead and confirmed my initial theory.

On the old version of def _requests(x), a new requests.request instance was being created for every single call to _request(). So one thread gets one requests instance.

The new version uses a shared requests.Session (stored as a class variable). Every single thread has to communicate through this one session. This can become a major bottleneck in multithreaded applications since a requests.Session object can only have 10 concurrent connections by default.

I've highlighted the issue below with a comment.

Old version: TMDB Simple 2.6.0

    def _request(self, method, path, params=None, payload=None):
        url = self._get_complete_url(path)
        params = self._get_params(params)

        # Creates a new requests session for every connection, great for multithreaded applications
        response = requests.request(
            method, url, params=params,
            data=json.dumps(payload) if payload else payload,
            headers=self.headers)

        response.raise_for_status()
        response.encoding = 'utf-8'
        return response.json()

New version: TMDB Simple 2.7.0

    def _request(self, method, path, params=None, payload=None):
        url = self._get_complete_url(path)
        params = self._get_params(params)

        # This line can be problematic since it's using the same connection pool for every thread
        response = self.session.request(
            method, url, params=params,
            data=json.dumps(payload) if payload else payload,
            headers=self.headers)

        response.raise_for_status()
        response.encoding = 'utf-8'
        return response.json()

The following solutions are available

1) If no requests session object is provided at tmdbsimple __init__, default to the old way of handling request instances. Comment: Best implementation to allow for developer flexibility. 2) Revert to the previous method of handling requests. Comment: Cleanest implementation, but prevents the developer from providing their own requests session. 3) Have a way (perhaps a class argument) for the user to easily increase the connection pool size for the default request session. Comment: Would be a minor annoyance to developers, but would work. 4) Note this as a limitation and provide documentation to change the requests session object with one that contains a larger connection pool. Comment: Would be an annoyance to developers, but would work. 5) Recommend all users of tmdbsimple to create a new tmdbsimple class upon starting every single thread. Comment: This would have a performance hit, but would work.

Let me know which solution is preferred, if development is needed I can do a PR.

p0psicles commented 3 years ago
  1. Provide your own request session object. And configure your desired connection pool size?

Then no changes required? I think that provides same flexibility as 1.

Archmonger commented 3 years ago

Your number 6 is my number 4 option. As of right now, your request session logic is an undocumented change that developers have no way of knowing about unless they ask, or manually inspect the source code. If they stumble across this issue in the future they're sure to open a GH issue just like me.

But number 4 does not provide the option of scalability. It will require the developer to continuously update their source every time their app gets more popular and needs a larger connection pool.

Archmonger commented 3 years ago

I just thought of another option...

  1. Instead of allowing the user to pass in a requests object, allow them to pass in requests session configuration parameters. A new sessions object would be created every call of def _request(x) and these configuration parameters (as **kwargs) will be applied every time.

Would that help in your use case where you needed to pass in a specific requests object?

p0psicles commented 3 years ago

I'm using the session to configure a proxy. So in user space I create the session object and optionally hook the proxy configuration to it. Then I pass the session object.

I don't think your solution would align well in my app. I think 1. Would be the best concession for both of us.

Archmonger commented 3 years ago

Glad we could reach a consensus 👍. I'll develop that tomorrow and open a PR.

Archmonger commented 3 years ago

PR has been made.

Implements option 1, and within readme.md I've documented your REQUESTS_SESSION change made in v2.7.0

celiao commented 3 years ago

@Archmonger @p0psicles Thanks for your discussion and examination of options! I agree option 1. is the best and most flexible for all users. Option 1. also keeps tmdbsimple simple, which is a guiding principle of the design.

I will look over the PR, give it a test, and merge, if ready to go.

Archmonger commented 3 years ago

@celiao Can you update the revision to 2.7.1 for the sake of pypi?

Thanks!

celiao commented 3 years ago

Ah yes. I updated the version to 2.8.0. 2.7 always bugged me because it looked like a Python version. ;)