agronholm / cbor2

Python CBOR (de)serializer with extensive tag support
MIT License
215 stars 57 forks source link

Pure Python vs. C inconsistency with QueryDict #121

Open felipeochoa opened 2 years ago

felipeochoa commented 2 years ago

Serializing a QueryDict from Django gives different results when using the optimized and pure python versions of the libary. Here's a small test case that shows the different values produced.

import cbor2
from django.http import QueryDict

def test_query_dict(self):
    assert cbor2.loads(cbor2.dumps(QueryDict('a=1&b=2')) == {'a': '1', 'b': '2'} # Pure python
    assert cbor2.loads(cbor2.dumps(QueryDict('a=1&b=2')) == {'a': ['1'], 'b': ['2']} # C

QueryDict is a bit of an odd class, but I think supporting the invariant loads(dumps(d))[key] == loads(dumps(d[key])) should guide the implementation, which makes the pure Python version the correct one here

Sekenre commented 2 years ago

From the looks of things, a QueryDict is a kind of multi-map, so the second version looks correct. Can you post the hex encoded output of the dumps call?

Sekenre commented 2 years ago

Also, I can't get your code to run. Could you provide a working test script that demonstrates the problem?

felipeochoa commented 2 years ago

Here's a close to minimal repro:

from _cbor2 import loads as cloads, dumps as cdumps
from cbor2.decoder import loads as pyloads
from cbor2.encoder import dumps as pydumps

class MVD(dict):

    def __getitem__(self, key):
        return super().__getitem__(key)[-1]

    def __setitem__(self, key, value):
        super().__setitem__(key, [value])

    def items(self):
        for key in self:
            yield key, self[key]

data = MVD()
data['a'] = '1'
data['b'] = '2'

assert cloads(cdumps(data)) == {'a': ['1'], 'b': ['2']}
assert pyloads(pydumps(data)) == {'a': '1', 'b': '2'}
Sekenre commented 2 years ago

Thanks for the clear example, that's very helpful. Here's what I think is going on.

Since the QueryDict is a subclass of dict the pure-python code is calling data.items() which has the special last value behaviour. The c-module is using https://docs.python.org/3/c-api/mapping.html#c.PyMapping_Items which does not call the python method, it reaches directly into the dict data structure.

If you want the same behaviour for both I would suggest calling the QueryDict.dict() method on your data before you pass it to the encoder.

If you want to preserve all the values with the py module; pass dict(QueryDict._iterlists()) to the encoder.

felipeochoa commented 2 years ago

Thanks. Yes -- I worked around it already with the QueryDict.dict method. (Arguably what I should have done to begin with)

I just filed this issue since I thought it exposed inconsistencies worth fixing or at least documenting. Maybe the answer is "don't use dict subclasses with cbor2"?