betamaxpy / betamax

A VCR imitation designed only for python-requests.
https://betamax.readthedocs.io/en/latest/
Other
565 stars 62 forks source link

Recording cassettes in production? #133

Closed smallnamespace closed 6 years ago

smallnamespace commented 7 years ago

Do you have any advice or tips for recording cassettes in a production setting?

My use case is pretty simple, and is probably something others have encountered: the API I'm hitting is quite unreliable and there a lot of exceptional cases to handle (and to test!).

It'd be nice to be able to catch exceptional cases as they are happening in the wild, get them onto a cassette, and then move the cassette into version control as a reference point for future development. Has anyone tried this that you know of?

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/43922584-recording-cassettes-in-production?utm_campaign=plugin&utm_content=tracker%2F198445&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F198445&utm_medium=issues&utm_source=github).
sigmavirus24 commented 7 years ago

Do you have any advice or tips for recording cassettes in a production setting?

Don't do it. 😄

I'm hitting is quite unreliable and there a lot of exceptional cases to handle (and to test!).

That is possible with Betamax, but you might also want to try to log as much information about these as possible and use something like requests-mock for some of them. That's especially true for timeouts, connection errors, etc.

It'd be nice to be able to catch exceptional cases as they are happening in the wild, get them onto a cassette, and then move the cassette into version control as a reference point for future development. Has anyone tried this that you know of?

No, and I can't imagine a reliable way of doing this that wouldn't harm your production deployment.

smallnamespace commented 7 years ago

Is your concern mainly about performance, or something else?

I know it sounds crazy, it's just the API I'm hitting as a very low throttling threshold, so I'm mostly trying to squeeze out as much information as I can per request that I make.

If not Betamax, I'd probably end up serializing a rolling history of the whole session and deleting it as I go along... which sounds a lot like just making cassettes and deleting them if nothing exceptional occurs (in fact googling around for requests serialization is how I ended up here).

So I guess my real question is, is there anything special Betamax does that would be much more unreliable or slower than that approach?

sigmavirus24 commented 7 years ago

Is your concern mainly about performance, or something else?

It's about correctness. Betamax has a few record modes (specifically all) that would do what it sounds like you want, but, you need to close the connection after each request/set of requests to ensure that it's actually written to disk. And if you record everything, that will not playback anything. Having Betamax between your session and your service won't cause issues but closing/opening cassettes means that Betamax will try to replay the existing interactions before recording new ones. That means you might have made a similar request an hour earlier and be making it again and you'll get old data instead of the newest/correct data.

Keep in mind that serializing/deserializing large amounts of JSON is also not fast. You'd probably want to write a custom serializer class for Betamax that uses a better serialization format.

If you want to cache things, I'd strongly suggest you look at CacheControl but it sounds like you also want to record exception responses to be able to test for them and then build workarounds?

smallnamespace commented 7 years ago

If you want to cache things, I'd strongly suggest you look at CacheControl but it sounds like you also want to record exception responses to be able to test for them and then build workarounds?

Yes exactly, I don't care about caching, just about ease of identifying exception cases and then testing or developing against them. Right now I'm doing everything in an ad-hoc way by just looking at logs, but I really need to log server responses as well.

Luckily, my interaction with the API can be logically grouped into separate sessions. Does this sound slightly less crazy?

  1. Create a new cassette for every session, with record mode set to all, and close the connection after each session
  2. If nothing exceptional happened (as defined by my code), delete the cassette, otherwise store it. Is there a better way to do this than just deleting the file from the cassette directory?
  3. Repeat
  4. Later, I can grab the new cassettes from the workers and select which ones make it into the test cassette library. Here we can make use of betamax again by seeing if any of our existing test cassettes already cause the exception that we just recorded.

As for custom serializing, does just running everything through pickle work? Probably for further debugging we'd want some way to re-serialize it to JSON for inspection, is this as simple as calling deserialize on my custom thing and then reserialize through the pretty JSON one?

sigmavirus24 commented 7 years ago

Does this sound slightly less crazy?

Slightly. Yes. 😄

Is there a better way to do this than just deleting the file from the cassette directory?

That's really the best way, unless you include something akin to a datetime in the cassette name (which avoids having to remove existing ones immediately afterwards). In that sense you could have a cron job which might be more reliable for collecting/removing non-exceptional cassettes.

As for custom serializing, does just running everything through pickle work?

I mean we give the serializers a dictionary with a very particular structure. As long as your serializer plugin can return that to us that's fine. You can pick whatever is fastest/best for you. I'm just pointing out that simplejson/built-in-json are not fast modules.

Probably for further debugging we'd want some way to re-serialize it to JSON for inspection, is this as simple as calling deserialize on my custom thing and then reserialize through the pretty JSON one?

You can just use the PrettyJSON one to start with and if you notice performance problems switch to something more performant.

smallnamespace commented 7 years ago

Cool, I'll let you know how this pans out :) For step (1), is there a way to force writing to disk even before the connection is closed?

sigmavirus24 commented 7 years ago

Cool, I'll let you know how this pans out :) For step (1), is there a way to force writing to disk even before the connection is closed?

As in you're using stream=True a lot? No.

sigmavirus24 commented 7 years ago

That's a limitation of how we serialize data about a request/response cycle

smallnamespace commented 7 years ago

I meant the urllib3 connection -- we can keep reusing that right? E.g. if I do:

s = requests.Session()
with Betamax(s).use_cassette('example') as vcr:
    r = s.get('https://httpbin.org/get')

# Cassette will be written at this point, even though the HTTP connection
# could still be in the adapter's pool ready for re-use
r2 = s.get('https://httpbin.org/get')

From browsing through the code (it's really clean btw), it looks like Betamax.__exit()__ guarantees the write?

UPDATE: Looks like BetamaxAdapter only closes its own default adapter, but leaves up connections that old adapters may have set up?

sigmavirus24 commented 7 years ago

UPDATE: Looks like BetamaxAdapter only closes its own default adapter, but leaves up connections that old adapters may have set up?

Correct, we use whatever adapters were already on the session if we need to actually talk to the internet. We don't mess with those connections and we don't use our own connections from urllib3. We just implement the TransportAdapter API for Requests.

elnuno commented 7 years ago

To add to the craziness level, you can try auto-generating cassette names and creating a new one for each request. Something like this:

    def request(self, *args, **kwargs):
        try:
            if args[0] in self.cacheable_methods:
                cassette_name = self.get_cassette_name_from_url(args[1])
                with self.betamax.use_cassette(cassette_name):
                    response = self._http.request(*args, timeout=TIMEOUT,
                                                  **kwargs)
            else:
                response = self._http.request(*args, timeout=TIMEOUT, **kwargs)
        except Exception as exc:
            raise RequestException(exc, args, kwargs)

        return response

Sorry about that.

Anyway, I think something like this (but well written) could add granularity so you can figure which API endpoints are flakiest.

hroncok commented 6 years ago

@smallnamespace Any progress with this?

sigmavirus24 commented 6 years ago

@hroncok due to the lack of activity, I think it makes sense to close this out. I wasn't expecting documentation to come from this anyway