betamaxpy / betamax

A VCR imitation designed only for python-requests.
https://betamax.readthedocs.io/en/latest/
Other
566 stars 62 forks source link

POST data match fails #65

Closed Glueon closed 9 years ago

Glueon commented 9 years ago

Doing requests I rely on POST fields. So I am using Betamax with:

match_requests_on=['host', 'path', 'method', 'body']

Unfortunetly betamax did not take any requests from cassettes. After diging a little bit I found out that the problem appears if I use more than one POST field.

Default body matcher does:

    recorded_request = deserialize_prepared_request(recorded_request)
    return recorded_request.body == (request.body or b'')

But in fact the order of post fields in recorded_request.body and request.body is different. That's why it didn't match anything.

So it is either a body serialization issue or a matcher one. It does not preserve the order in which fields appear, but takes the order into account while matching bodies.

As a workaround I used:

class OrderlessBodyMatcher(BaseMatcher):
    name = 'orderless_body'

    def match(self, request, recorded_request):
        from urllib.parse import parse_qs
        return parse_qs(request.body) == parse_qs(recorded_request['body']['string'])

Maybe default implementation should fixed in such a way?

sigmavirus24 commented 9 years ago

If you install betamax-matchers you can use the URLEncodedBodyMatcher by doing:

from betamax import Betamax
from betamax_matchers.form_urlencoded import URLEncodedBodyMatcher

Betamax.register_request_matcher(URLEncodedBodyMatcher)

def test_urlencoded_body():
    s = requests.Session()
    m = ['method', 'uri', 'form-urlencoded-body']
    with Betamax(s).use_cassette('urlencoded', match_requests_on=m):
        r = s.post('https://httpbin.org/post', data={
            'form-field-0': 'value-0',
            'form-field-1': 'value-1',
            'form-field-2': 'value-2',
        })
Glueon commented 9 years ago

Thanks, I'll try it out. But isn't such behaviour of the default 'body' matcher a littble bit misleading? Because request bodies are the same. Is there a purpose for?

Glueon commented 9 years ago

I tried to use form-urlencoded-body but it still misses the cache. The code I use is alsmost the same as yours. I think that if fails on Content-Type check. Because in cassettes json files Content-Type there is text/html.

sigmavirus24 commented 9 years ago

But isn't such behaviour of the default 'body' matcher a littble bit misleading?

No it isn't. Most of the default checks are, by design, simplistic. They follow the rules of their equivalent checks in VCR. The body matcher by default simply checks equality, because that's all that is necessary if you do:

import requests
import betamax

s = requests.Session()
recorder = betamax.Betamax(s, cassette_library_path='.')
with recorder.use_cassette('example_body_matcher'):
    s.post(url, data=b'my data to send to the server', headers={'Content-Type': 'text/plain'})

The person using betamax knows what they need far better than betamax can guess. In this very specific case, if I had built form-encoding parsing into the body matcher, I would only have done it for bodies that have Content-Types of application/x-www-form-urlencoded and it would still be failing because your Content-Type apparently is text/html so it would be broken in this case anyway.

Glueon commented 9 years ago

I agree that form-urlencoded-body does what it does and it should fail in my case because content-type does not match whatevet it should to be.

But in my opinion if I do POST request with b=1 and a=1 params, so the request body is b=1&a=1 I expect the same request with the same parametes to be taken from a disk, because body is the same.

I would agree that body matcher can skip the request with body b=1&a=1 if earlier I did a=1&b=1. Conceptually request is the same, but the order is different and I am the one who changed that order explicitly.

As I see betamax implicitly changes the order of the parameters while serializing the body. So even I always put POST args in the same order that choosen order will affect the result. And I do not see such implicit behaviour to be ok and I doubt ruby version does that.

# This will be always fetched
r = s.post('https://httpbin.org/post', data={
'b': 'foo',
'a': 'foo'})
# This will be fetched only once and then take it from disk
r = s.post('https://httpbin.org/post', data={
'a': 'foo',
'b': 'foo'})

The body matcher by default simply checks equality, because that's all that is necessary

Yes and they should be equal if deseriazes(serialize(body)) == body. Which in your case is false.

sigmavirus24 commented 9 years ago

Please research how dictionaries work in Python. They have no deterministic order. When requests sees that dictionary, the order of the post body will be one of two things everytime. That's because it is a dictionary not because requests or betamax are doing this implicitly

Glueon commented 9 years ago

Sorry, that was a bad example. But my point is that inside match recorded_request['body']['string'] and request.body are the same except for the order when I do the same POST request. I do not understand when that happens, but it causes eqallity comparison to fail, because you compare strings while probably you should compare dictionaries to ignore the order. I do not understand why it shouldn't.

Are the bodies of two the same post requests the same? Yes. Should the body matcher find that request? I think so. Where am I wrong?

sigmavirus24 commented 9 years ago

Are the bodies of two the same post requests the same?

They're semantically the same. a=foo&b=foo and b=foo&a=foo are semantically the same. Take a look at this example:

>>> r = requests.post(url, data={'a': 'b', 'c': 'd', 'e': 'f'})
>>> r.request.body
'a=b&c=d&e=f'

That value (a=b&c=d&e=f) is exactly the value that betamax stores in a cassette. The next time I make that request this happens:

>>> r = requests.post(url, data={'a': 'b', 'c': 'd', 'e': 'f'})
>>> r.request.body
'c=d&e=f&a=b'

Here are some reasons why we don't compare dictionaries:

requests.post(url, json={'foo': 'bar'})
with open('afile.txt', 'rb') as fd:
    contents = fd.read()
requests.post(url, data=contents)
requests.post(url, data={'foo': 'bar'}, files={'biz': 'baz', 'bar': 'abcde'})

We can't turn each of those into dictionaries. And in your case, the Content-Type would be misleading as to whether or not we could (which is what we would have to inspect in order to determine how to parse a body into a dictionary, and a dictionary also isn't the only data structure that a body could be parsed into).

As I see betamax implicitly changes the order of the parameters while serializing the body.

I assume you haven't read the code. If you had, you'd see we do not modify the body. We will store it converted as a base64 encoded blob when asked (or in certain other special cases) but we never modify the body. So, no, we're not implicitly changing the order of anything when we serialize the body.

But my point is that inside match recorded_request['body']['string'] and request.body are the same except for the order when I do the same POST request.

Correct. That happens anyway, with or without betamax. This happens in requests. Inspect the response.request.body attribute to see it. If you're on Python 2, you'll have a harder time seeing it inside the same interpreter session, so you'll want to make a script to print that information and run it twice.

If you don't want to maintain a special matcher for your code, you can do the following

r = requests.post(url, data=[('foo', 'bar'), ('biz', 'baz')])

As long as the data is always in that order, it will always match because requests will always serialize it in that exact order:

>>> r = requests.post(url, data=[('a', 'b'), ('c', 'd'), ('e', 'f')])
>>> r.request.body
'a=b&c=d&e=f'
>>> r = requests.post(url, data=[('a', 'b'), ('c', 'd'), ('e', 'f')])
>>> r.request.body
'a=b&c=d&e=f'
>>>
Glueon commented 9 years ago

Thank you for detailed explanation. Lot to learn. Haven't thought it's requests which creates a query string with params in different order.

Though I still think body matcher should check if queries are semantically equal but I accept your point that default matchers should be as simple as possible.

P.S. Thank you for your talk on PyCon.

sigmavirus24 commented 9 years ago

@Glueon I'm glad you found the talk useful. In that case, do you mind if I close this issue?

Glueon commented 9 years ago

No, thank you.