betamaxpy / betamax

A VCR imitation designed only for python-requests.
https://betamax.readthedocs.io/en/latest/
Other
567 stars 62 forks source link

gethostbyname slowing things down on OS X #96

Closed bboe closed 7 years ago

bboe commented 8 years ago

I finally got annoyed that my integration tests would periodically slow down. After running a profiler I saw that socket.gethostbyname was the bottleneck. I adapted the betamax example to show the stacktrace that shows that socket.gethostbyname is indeed being called. I'm guessing that it probably shouldn't be.

../.venv/pc/lib/python2.7/site-packages/requests-2.9.1-py2.7.egg/requests/sessions.py:480: in get
    return self.request('GET', url, **kwargs)
../.venv/pc/lib/python2.7/site-packages/requests-2.9.1-py2.7.egg/requests/sessions.py:459: in request
    prep.url, proxies, stream, verify, cert
../.venv/pc/lib/python2.7/site-packages/requests-2.9.1-py2.7.egg/requests/sessions.py:617: in merge_environment_settings
    env_proxies = get_environ_proxies(url) or {}
../.venv/pc/lib/python2.7/site-packages/requests-2.9.1-py2.7.egg/requests/utils.py:562: in get_environ_proxies
    if should_bypass_proxies(url):
../.venv/pc/lib/python2.7/site-packages/requests-2.9.1-py2.7.egg/requests/utils.py:551: in should_bypass_proxies
    bypass = proxy_bypass(netloc)
/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1487: in proxy_bypass
    return proxy_bypass_macosx_sysconf(host)
/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1454: in proxy_bypass_macosx_sysconf
    hostIP = ip2num(hostIP)

Here's the example

from betamax import Betamax
from requests import Session
from unittest import TestCase

with Betamax.configure() as config:
    config.cassette_library_dir = '.'

import socket
# comment the next line out when creating the cassette
socket.gethostbyname = lambda x: None

class TestGitHubAPI(TestCase):
    def setUp(self):
        self.session = Session()

    # Set the cassette in line with the context declaration
    def test_repo(self):
        with Betamax(self.session).use_cassette('repo'):
            resp = self.session.get(
                'https://api.github.com/repos/sigmavirus24/github3.py'
                )
            assert resp.json()['owner'] != {}

socket.gethostbyname does not get called when I tested from a linux machine.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/31935842-gethostbyname-slowing-things-down-on-os-x?utm_campaign=plugin&utm_content=tracker%2F198445&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F198445&utm_medium=issues&utm_source=github).
sigmavirus24 commented 8 years ago

Note this line in the traceback:

    return proxy_bypass_macosx_sysconf(host)

Also note that this is code that happens in requests prior to betamax taking control over the request/response flow.

Short of monkey-patching the standard library on a very specific operating system only when a cassette has already been recorded seems like overkill and out of the scope of this project to me.

Do you agree?

bboe commented 8 years ago

Also note that this is code that happens in requests prior to betamax taking control over the request/response flow.

I hadn't realized that. From my perspective, I'm calling get after associating betamax with the session. Thus it seems reasonable that there would be no network side effects when a cassette is present.

Rather than monkey patching the standard library, perhaps betamax could augment requests to bypass the check for proxies when a cassette is present.

With that said, I understand if you feel it's out-of-scope. Perhaps it's worth mentioning the issue in the docs because it's fairly significant. Usually my test suite runs in fewer than half a second, but in the cases, for whatever reason, when it starts checking this proxy setting it takes > 30 seconds because for every request there is a DNS lookup. Maybe there's a separate issue here which is preventing those DNS responses from being cached; they're all for the same host.

If I wanted to do the latter, do you have understanding of what is actually happening here? I'd be happy to write something up for the documentation if I knew why these DNS requests were always happening.

sigmavirus24 commented 8 years ago

Rather than monkey patching the standard library, perhaps betamax could augment requests to bypass the check for proxies when a cassette is present.

So the point of Betamax working the way it does is to avoid monkey-patching anything. That said, if we want to prevent requests from using this, we need to monkey-patch requests (but, again, only when the cassette is already recorded).

I understand the frustration with this.

Maybe there's a separate issue here which is preventing those DNS responses from being cached; they're all for the same host.

That would be a requests specific question IMO. I can look into it tonight maybe.

sigmavirus24 commented 8 years ago

Hm, so this only ever happens when handling redirects. The proxy_bypass stdlib code is used by requests.utils.should_bypass_proxies which is used elsewhere in requests.utils.get_environ_proxies.

get_environ_proxies is used in Session#merge_environment_settings (which is called from Session#request)

And both get_environ_proxies and should_bypass_proxies are used in SessionRedirectMixin#rebuild_proxies.

In both cases, they are guarded by the trust_env setting on the Session object. In other words, if we want to avoid this without monkey-patching, we could force that to be False when we attach our adapters if we're not planning on recording new interactions. I'm a little hesitant to do that though. Would a sufficient way of addressing this be to document it?

bboe commented 8 years ago

Would a sufficient way of addressing this be to document it?

Yes, I think explaining why this occurs and a potential work around (my fix below) would be suitable.

# Temporarily work around issue with gethostbyname on OS X
import socket
socket.gethostbyname = lambda x: '127.0.0.1'

I do, however, like your solution of forcing trust_env to be False when not recording. I'm curious as to what your hesitation is. Alternatively, perhaps there could be a way for the user to hook betamax in such cases to change that setting when recording is not to take place? That way betamax needn't always do it, but it can be configured (without monkey patching). Thoughts?

sigmavirus24 commented 8 years ago

I'm curious as to what your hesitation is.

We don't change anything else about the Session besides its adapters. I can't think of a case where someone might be subclassing a Session and looking at trust_env but I think I'd like a way to let the user say "Do not override my setting for trust_env" and I'd be happy. That said, we won't know if we should override trust_env until the user calls Betamax#use_cassette. That said, use_cassette does not start recording anything. It just loads a cassette. So we would actually need to do this in Betamax#start but the problem with that is that someone can do:

with Betamax(session) as recorder:
    recorder.use_cassette('cassette_name')

So we'll have to track extra state on the Betamax object which makes me a sad :panda_face:

sigmavirus24 commented 8 years ago

107 added documentation about this for the 0.7.0 milestone. I'm going to remove that milestone now. When there is a good solution, I'll re-milestone this.

sigmavirus24 commented 8 years ago

See also http://betamax.readthedocs.org/en/latest/long_term_usage.html#known-issues

bboe commented 8 years ago

Awesome. Thanks for the update.

bboe commented 7 years ago

Closing this issue as I feel the documentation addition is sufficient.