betamaxpy / betamax

A VCR imitation designed only for python-requests.
https://betamax.readthedocs.io/en/latest/
Other
565 stars 62 forks source link

Discrepancy in original headers and playback headers (bytes vs. strs) #122

Closed wimglenn closed 2 years ago

wimglenn commented 7 years ago

I think there is something whack in the way non-ascii data is handled. I'm getting discrepancies in the response objects from live server and playback, and this is causing some issues in my app. I've tried to make a minimal example to reproduce the bug in a script below:

# coding: utf-8
from __future__ import unicode_literals

import os

import betamax, requests

headers = {'Beer': '🍺'.encode('punycode')}
url = 'http://www.example.com/'
session = requests.Session()
recorder = betamax.Betamax(session, cassette_library_dir='./example')

# I'm assuming subdir ./example exists and is empty at this point
#os.remove('./example/dump.json')   # <-- uncomment if you don't want to delete it manually
if os.path.exists('./example/dump.json'):
    raise Exception('not clean state')

with recorder.use_cassette('dump'):
    response1 = session.get(url, headers=headers)

with recorder.use_cassette('dump'):
    response2 = session.get(url, headers=headers)

h1 = response1.request.headers['Beer']
h2 = response2.request.headers['Beer']

if type(h1) != type(h2):
    print('not ok :(')
    print('header from recording: {!r}\nheader from playback: {!r}'.format(h1, h2))
else:
    print('ok :)')

It fails in both python2 and python3, but with different failure modes.
In python2 it fails because the response1 and response2 have mutated.
In python3 it fails earlier, during the recording, whilst attempting to serialise a bytestring - an action which is a TypeError in python3 (and this gives a big hint at why the python2 version fails).

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/38125389-discrepancy-in-original-headers-and-playback-headers?utm_campaign=plugin&utm_content=tracker%2F198445&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F198445&utm_medium=issues&utm_source=github).
sigmavirus24 commented 7 years ago

Hi @wimglenn,

Sorry for the delayed response. I'm a bit sick so my response times will be kind of bad. I also haven't had a chance to toy with your reproduction test case but I sincerely appreciate you putting together a tiny one like that.

Could you also post the tracebacks you're seeing? Those would help me narrow this down and maybe guide you towards sending a pull request that fixes this. Would you be interested in doing that?

wimglenn commented 7 years ago

Sure, I can submit a PR. I just cloned and ran pytest, all 132 existing tests are passing here. I'm a bit busy with other things now but I'll see if I can write a proper test and the fix when I get the chance to look into it more closely.

Traceback below:

TypeError                                 Traceback (most recent call last)
<ipython-input-1-8769d5594677> in <module>()
     17 
     18 with recorder.use_cassette('dump'):
---> 19     response1 = session.get(url, headers=headers)
     20 
     21 with recorder.use_cassette('dump'):

/Users/wim/src/betamax/betamax/recorder.py in __exit__(self, *ex_args)
     70 
     71     def __exit__(self, *ex_args):
---> 72         self.stop()
     73         # ex_args comes through as the exception type, exception value and
     74         # exception traceback. If any of them are not None, we should probably

/Users/wim/src/betamax/betamax/recorder.py in stop(self)
    129         """Stop recording or replaying interactions."""
    130         # No need to keep the cassette in memory any longer.
--> 131         self.betamax_adapter.eject_cassette()
    132         # On exit, we no longer wish to use our adapter and we want the
    133         # session to behave normally! Woooo!

/Users/wim/src/betamax/betamax/adapter.py in eject_cassette(self)
     48         """Eject currently loaded cassette."""
     49         if self.cassette:
---> 50             self.cassette.eject()
     51         self.cassette = None  # Allow self.cassette to be garbage-collected
     52 

/Users/wim/src/betamax/betamax/cassette/cassette.py in eject(self)
    108 
    109     def eject(self):
--> 110         self._save_cassette()
    111 
    112     def find_match(self, request):

/Users/wim/src/betamax/betamax/cassette/cassette.py in _save_cassette(self)
    209             'recorded_with': 'betamax/{0}'.format(__version__)
    210         }
--> 211         self.serializer.serialize(cassette_data)
    212 
    213 

/Users/wim/src/betamax/betamax/serializers/proxy.py in serialize(self, cassette_data)
     66 
     67         with open(self.cassette_path, 'w') as fd:
---> 68             fd.write(self.proxied_serializer.serialize(cassette_data))
     69 
     70     def deserialize(self):

/Users/wim/src/betamax/betamax/serializers/json_serializer.py in serialize(self, cassette_data)
     15 
     16     def serialize(self, cassette_data):
---> 17         return json.dumps(cassette_data)
     18 
     19     def deserialize(self, cassette_data):

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    228         cls is None and indent is None and separators is None and
    229         default is None and not sort_keys and not kw):
--> 230         return _default_encoder.encode(obj)
    231     if cls is None:
    232         cls = JSONEncoder

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py in encode(self, o)
    196         # exceptions aren't as detailed.  The list call should be roughly
    197         # equivalent to the PySequence_Fast that ''.join() would do.
--> 198         chunks = self.iterencode(o, _one_shot=True)
    199         if not isinstance(chunks, (list, tuple)):
    200             chunks = list(chunks)

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py in iterencode(self, o, _one_shot)
    254                 self.key_separator, self.item_separator, self.sort_keys,
    255                 self.skipkeys, _one_shot)
--> 256         return _iterencode(o, 0)
    257 
    258 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/encoder.py in default(self, o)
    177 
    178         """
--> 179         raise TypeError(repr(o) + " is not JSON serializable")
    180 
    181     def encode(self, o):

TypeError: b'xj8h' is not JSON serializable
wimglenn commented 7 years ago

I'm working on it in my fork

https://github.com/wimglenn/betamax/tree/bytes_vs_text_handling

Unfortunately the fix will have to be quite intrusive, because I think what needs to be done simply can't be done with JSONSerializer. That's because it's a format which doesn't correctly preserve the types of the strings inside a python dict.

on python2, it's data destructive:

>>> d = {'key1': 'val1', 'key2': u'val2'}
>>> d
{'key1': 'val1', 'key2': u'val2'}
>>> json.loads(json.dumps(d))
{u'key1': u'val1', u'key2': u'val2'}

on python3, it bails out (more correct, but makes betamax fail hard):

>>> d = {'key1': 'val1', 'key2': b'val2'}
>>> d
{'key1': 'val1', 'key2': b'val2'}
>>> json.loads(json.dumps(d))
TypeError: b'val2' is not JSON serializable
wimglenn commented 7 years ago

https://github.com/sigmavirus24/betamax/pull/123

christianmlong commented 5 years ago

I ran in to a similar problem when using Requests, Betamax, and some third library such as suds-jurko or requests-oauthlib. Those extra libraries can introduce byte strings in to the flow, and the currently-available Betamax serializers can't handle bytes.

I wrote up some documentation about the problem, and a possible fix.

sigmavirus24 commented 5 years ago

I think we have a partial fix but it's not yet been released. @hroncok would you have time to help prepare a release?

hroncok commented 5 years ago

Sure. Not today, but sometimes during the weekend should be fine. Anything special, or just changelog + version bump + tag?

sigmavirus24 commented 5 years ago

I don't think we need anything beyond what you mentioned. Thank you in advance. I'm traveling next week and won't have time for this any time soon

kfcaio commented 2 years ago

@sigmavirus24 and @christianmlong I wrote a test for one function that downloads a large zip file using requests module. I've found discrepancy in Content-Length when comparing test execution with betamax and without it. Using Betamax, the Content-Length of the binary string extracted is way larger. Besides that, I need to pass that binary string to BytesIO and then to zipfile.ZipFile, but got zipfile.BadZipFile: Bad magic number for central directory exception. Let me know if my problem is not related to this discussion.

My test setup:

import betamax
from betamax.fixtures import unittest

mode = os.getenv('BETAMAX_RECORD_MODE')
with betamax.Betamax.configure() as config:
    config.cassette_library_dir = 'tests/test_funcs/cassettes'
    config.default_cassette_options['record_mode'] = mode
    print(f'Using record mode <{mode}>')

def the_function(session):
    # session = requests.Session()
    from io import BytesIO
    from zipfile import ZipFile

    response = session.get("https://ww2.stj.jus.br/docs_internet/processo/dje/xml/stj_dje_20211011_xml.zip")

    zip_in_memory = BytesIO(response.content)

    try:
        my_zip = ZipFile(zip_in_memory, 'r')
        my_zip.testzip()
        result = True
    except Exception:
        result = False

    return result

class BaseTest(unittest.BetamaxTestCase):
    custom_headers = None
    custom_proxies = None
    _path_to_ignore = None
    _no_generator_return_search = False

    def setUp(self):
        super(BaseTest, self).setUp()
        if self.custom_headers:
            self.session.headers.update(self.custom_headers)
        if self.custom_proxies:
            self.session.proxies.update(self.custom_proxies)
        self.worker_under_test = self.worker_class()
        self.worker_under_test._session = self.session

    def test_search(self):
        result = the_function(self.session)
        assert result

I pass the self.session to function under test and use it to get a endpoint. Through that endpoint, I get the zip file in the form of bytes string (response.content). I found that test runs without errors if I don't use the Betamax session.

Related question: https://stackoverflow.com/questions/69653406/how-to-mock-a-function-that-downloads-a-large-binary-content-using-betamax

sigmavirus24 commented 2 years ago

I don't believe your issue is related to this issue (and likely that we should close this one at this point). Please open an issue and include the headers from the response that you're recording.