Large cassettes cause memory errors

kratsg commented 4 years ago

 _____________________________ test_get_image_model _____________________________
 auth_client = <itkdb.client.Client object at 0x7fc9cba70970>
 tmpdir = local('/tmp/pytest-of-root/pytest-0/test_get_image_model0')
     def test_get_image_model(auth_client, tmpdir):
         with betamax.Betamax(
             auth_client, cassette_library_dir=itkdb.settings.ITKDB_CASSETTE_LIBRARY_DIR
         ) as recorder:
             recorder.use_cassette('test_images.test_get_image', record='none')
             image = auth_client.get(
                 'uu-app-binarystore/getBinaryData',
                 json={
                     'code': 'bc2eccc58366655352582970d3f81bf46f15a48cf0cb98d74e21463f1dc4dcb9'
                 },
             )
             assert isinstance(image, itkdb.models.Image)
             assert image.filename == 'PB6.CR2'
             assert image.format == 'cr2'
             temp = tmpdir.join("saved_image.cr2")
             nbytes = image.save(filename=temp.strpath)
 >           assert nbytes == 64603006
 tests/integration/test_image.py:34: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 /usr/local/lib/python3.8/site-packages/betamax/recorder.py:72: in __exit__
     self.stop()
 /usr/local/lib/python3.8/site-packages/betamax/recorder.py:131: in stop
     self.betamax_adapter.eject_cassette()
 /usr/local/lib/python3.8/site-packages/betamax/adapter.py:50: in eject_cassette
     self.cassette.eject()
 /usr/local/lib/python3.8/site-packages/betamax/cassette/cassette.py:110: in eject
     self._save_cassette()
 /usr/local/lib/python3.8/site-packages/betamax/cassette/cassette.py:205: in _save_cassette
     self.sanitize_interactions()
 /usr/local/lib/python3.8/site-packages/betamax/cassette/cassette.py:179: in sanitize_interactions
     i.replace_all(self.placeholders, True)
 /usr/local/lib/python3.8/site-packages/betamax/cassette/interaction.py:69: in replace_all
     self.replace(*placeholder.unpack(serializing))
 /usr/local/lib/python3.8/site-packages/betamax/cassette/interaction.py:63: in replace
     self.replace_in_body(text_to_replace, placeholder)
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 self = <betamax.cassette.interaction.Interaction object at 0x7fc9cbc0dee0>
 text_to_replace = '', placeholder = '<ACCESSCODE2>'
     def replace_in_body(self, text_to_replace, placeholder):
         for obj in ('request', 'response'):
             body = self.data[obj]['body']
             old_style = hasattr(body, 'replace')
             if not old_style:
                 body = body.get('string', '')

             if text_to_replace in body:
 >               body = body.replace(text_to_replace, placeholder)
 E               MemoryError
 /usr/local/lib/python3.8/site-packages/betamax/cassette/interaction.py:89: MemoryError

When a previously recorded cassette is very large, trying to replace causes an error. Any ideas how to get around this?

kratsg commented 4 years ago

Personally, it would be helpful that cassettes that get ejected don't need to be saved. I think that's what is happening, even if it's meant to be read-only.

sigmavirus24 commented 4 years ago

             if text_to_replace in body:
 >               body = body.replace(text_to_replace, placeholder)

This is treating the body as a string so we're doing str.replace here because that's worked for what has seemed to be every other use-case. We'd need a way to not have a giant string in memory (which we need to do to build the cassette unfortunately) and we'd need a way to in-place replace the string (which python doesn't allow because strings are immutable).

I'd be open to discussing how we can fix this, but I haven't spent much time on this library in a long time and I don't know of a quick solution here aside from loading the data into something like a StringIO and manipulating that (and then figuring out how to manipulate that with a JSONEncoder/Decoder)

sigmavirus24 commented 4 years ago

Personally, it would be helpful that cassettes that get ejected don't need to be saved. I think that's what is happening, even if it's meant to be read-only.

In this case we're trying to look for and replace the placeholders you've defined. We do it every time because one most likely wants to have tests that appear to work and so a placeholder being substituted back in works well for pretty much everyone.

kratsg commented 4 years ago

In this case we're trying to look for and replace the placeholders you've defined. We do it every time because one most likely wants to have tests that appear to work and so a placeholder being substituted back in works well for pretty much everyone.

So the only replacement I do here for placeholders is for bearer token. I would be open to a way to disable bearer token substitution when running these tests on a previously recorded cassette (assuming I do not want to create it again). [Since it's already replaced in the header as part of recording in the first place].

kratsg commented 4 years ago

So there are two short-term options on the table here:

1) edit the cassette to reduce the size (truncate it). This might work in the case where you don't actually care about the body contents, but just need to check workflow/integration tests. 2) allow betamax to be configured to skip replacing placeholders manually

Here's how option 2 would work for pytest using pytest --no-placeholders from command line:

def pytest_addoption(parser):
    parser.addoption(
        "--no-placeholders", action="store_true", help="skip making placeholders"
    )   

def configure_betamax(no_placeholders):
    placeholders = { 
        'accessCode1': itkdb.settings.ITKDB_ACCESS_CODE1,
        'accessCode2': itkdb.settings.ITKDB_ACCESS_CODE2,
    }   

    def filter_bearer_token(interaction, current_cassette):
        # Exit early if the request did not return 200 OK because that's the
        # only time we want to look for Authorization-Token headers
        if interaction.data['response']['status']['code'] != 200:
            return

        headers = interaction.data['request']['headers']
        token = headers.get('Authorization')
        # If there was no token header in the request, exit
        if token is None:
            return

        # Otherwise, create a new placeholder so that when cassette is saved,
        # Betamax will replace the token with our placeholder.
        current_cassette.placeholders.append(
            betamax.cassette.cassette.Placeholder(
                placeholder='Bearer <ACCESS_TOKEN>', replace=token[0]
            )   
        )   

    betamax.Betamax.register_serializer(pretty_json.PrettyJSONSerializer)
    with betamax.Betamax.configure() as config:
        config.default_cassette_options['serialize_with'] = 'prettyjson'
        if not no_placeholders:
            config.before_record(callback=filter_bearer_token)
            for key, value in placeholders.items():
                config.define_cassette_placeholder(
                    '<{}>'.format(key.upper()), replace=value
                )   

def pytest_configure(config):
    configure_betamax(config.option.no_placeholders)

The long-term option would be to work with StringIO/BytesIO and iterate over line-by-line (or skip parsing the body if possible).

kratsg commented 4 years ago

I have a happy workaround for now. In this case, I know when I'm getting binary data from the API I use, so i can do the following:

    # In cases where we get a large amount of binary data from the server, we'll truncate this.
    if 'uu-app-binarystore/getBinaryData' in interaction.data['request']['uri']:
        interaction.data['response']['body']['string'] = interaction.data['response'][
            'body'
        ]['string'][:1000]

to truncate the cassette automatically. This works pretty well -- since I don't care about the actual binary data content.

sigmavirus24 commented 4 years ago

I'm re-opening this since I think it's a legitimate problem betamax might want to solve for.

betamaxpy / betamax

Large cassettes cause memory errors #187