betamaxpy / betamax

A VCR imitation designed only for python-requests.
https://betamax.readthedocs.io/en/latest/
Other
567 stars 62 forks source link

Discrepancy between playback and actual headers when dealing with binary data #194

Open kfcaio opened 2 years ago

kfcaio commented 2 years ago

@sigmavirus24 I wrote a test for one function that downloads a large zip file using requests module. I've found discrepancy in Content-Length when comparing test execution with betamax and without it. Using Betamax, the length of the binary string extracted is way larger. Besides that, I need to pass that binary string to BytesIO and then to zipfile.ZipFile, but got zipfile.BadZipFile: Bad magic number for central directory exception.

My test setup:

import betamax
from betamax.fixtures import unittest
import os

mode = os.getenv('BETAMAX_RECORD_MODE')
with betamax.Betamax.configure() as config:
    config.cassette_library_dir = 'tests/test_funcs/cassettes'
    config.default_cassette_options['record_mode'] = mode
    print(f'Using record mode <{mode}>')

def the_function(session):
    # session = requests.Session()
    from io import BytesIO
    from zipfile import ZipFile

    response = session.get("https://ww2.stj.jus.br/docs_internet/processo/dje/xml/stj_dje_20211011_xml.zip")

    zip_in_memory = BytesIO(response.content)

    try:
        my_zip = ZipFile(zip_in_memory, 'r')
        my_zip.testzip()
        result = True
    except Exception:
        result = False

    return result

class BaseTest(unittest.BetamaxTestCase):
    custom_headers = None
    custom_proxies = None
    _path_to_ignore = None
    _no_generator_return_search = False

    def setUp(self):
        super(BaseTest, self).setUp()
        if self.custom_headers:
            self.session.headers.update(self.custom_headers)
        if self.custom_proxies:
            self.session.proxies.update(self.custom_proxies)
        self.worker_under_test = self.worker_class()
        self.worker_under_test._session = self.session

    def test_search(self):
        result = the_function(self.session)
        assert result

I pass the self.session to function under test and use it to get a endpoint. Through that endpoint, I get the zip file in the form of bytes string (response.content). I found that test runs without errors if I don't use the Betamax session.

Test

Session headers

{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Request headers

{'Accept-Ranges': 'bytes', 'ETag': 'W/"159406-1633990217000"', 'Last-Modified': 'Mon, 11 Oct 2021 22:10:17 GMT', 'Content-Type': 'application/zip', 'Content-Length': '159406', 'Date': 'Thu, 21 Oct 2021 14:37:27 GMT', 'Set-Cookie': 'BIGipServerpool_wserv=973081866.20480.0000; path=/; Httponly, TS01dc523b=016a5b383346ca02628a7c1dd47ef26e8cadf4a1b22fa9261c6b9ac1de8ac5665e99bd4a42c5b1d0af72b97105f57020b5e0f78fa7452df6080bf5ea3ee7a85d2de98968a2; Path=/; Domain=.www.stj.jus.br', 'Strict-Transport-Security': 'max-age=604800; includeSubDomains', 'Content-Security-Policy': "upgrade-insecure-requests; frame-ancestors 'self' https://*.stj.jus.br https://*.web.stj.jus.br https://stjjus.sharepoint.com/"}

Actual content length

len(response.content) == 288055

Script execution

Session headers

{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Request headers

{'Accept-Ranges': 'bytes', 'ETag': 'W/"159406-1633990217000"', 'Last-Modified': 'Mon, 11 Oct 2021 22:10:17 GMT', 'Content-Type': 'application/zip', 'Content-Length': '159406', 'Date': 'Thu, 21 Oct 2021 14:39:24 GMT', 'Set-Cookie': 'BIGipServerpool_wserv=973081866.20480.0000; path=/; Httponly, TS01dc523b=016a5b3833746a54a2d1276a2b3de87f48f672e9cd7c18c4dad842ddddeac244bcbcf1a470b59eecf83bd6a3bdeffc7c7017210981de929d01df6c054118625399d2b04ad2; Path=/; Domain=.www.stj.jus.br', 'Strict-Transport-Security': 'max-age=604800; includeSubDomains', 'Content-Security-Policy': "upgrade-insecure-requests; frame-ancestors 'self' https://*.stj.jus.br https://*.web.stj.jus.br https://stjjus.sharepoint.com/"}

Actual content length

len(response.content) == 159406

I'm using Python 3.8.2, Betamax 0.8.1, Pytest 5.4.1 to run test and Requests 2.25.1

Related question: https://stackoverflow.com/questions/69653406/how-to-mock-a-function-that-downloads-a-large-binary-content-using-betamax

Related issue: https://github.com/betamaxpy/betamax/issues/122

sigmavirus24 commented 2 years ago

Can you try setting preserve_exact_body_bytes=True on your config? https://betamax.readthedocs.io/en/latest/api.html?highlight=bytes#forcing-bytes-to-be-preserved I wonder if we need a heuristic around Content-Type: application/zip

kfcaio commented 2 years ago

Thank you for your quick response. It worked, but no http interactions were recorded using BETAMAX_RECORD_MODE=all

{"http_interactions": [], "recorded_with": "betamax/0.8.1"}

Is it expected?

sigmavirus24 commented 2 years ago

No but all is not generally advisable. Why are you using all?

kfcaio commented 2 years ago

@sigmavirus24 my bad, I was creating a new session somewhere in my actual script. It worked as expected, thank you! I think you may close this one

sigmavirus24 commented 2 years ago

Would you want to add a heuristic via PR for that content-type to automatically preserve the exact body bytes? I think that is a reasonable feature request and PR and should be small-ish in effort

kfcaio commented 2 years ago

Sure : )

sigmavirus24 commented 2 years ago

If it helps to get started, https://github.com/betamaxpy/betamax/blob/2c12cee59ac365f39497a3718eed04ab9c6ce988/src/betamax/util.py#L58-L59 is where I'm thinking we need a change. I suspect, however, that we want to keep that from becoming too complicated to read, so if you want to make the condition a separate function I'm :+1: on that.