agronholm / cbor2

Python CBOR (de)serializer with extensive tag support
MIT License
226 stars 59 forks source link

Wrong/misleading documentation for `default=` argument of encoder, which leads to corrupt CBOR when followed #228

Open burnpanck opened 7 months ago

burnpanck commented 7 months ago

Things to check first

cbor2 version

5.6.2

Python version

3.12.2

What happened?

The documentation on "Customizing encoding and decoding" currently states the follows in the leading paragraph (emphasis mine):

On the encoder side, this is accomplished by passing a callback as the default constructor argument. This callback will receive an object that the encoder could not serialize on its own. The callback should then return a value that the encoder can serialize on its own, ...

This description seems to roughly match the api provided by the default= argument of the json module. Based on this description, I would have expected the following code to work:

from types import SimpleNamespace
import cbor2

def default(encoder, o):
    if isinstance(o, SimpleNamespace):
        return vars(ret)
    raise TypeError(type(o).__name__)

orig = SimpleNamespace(a=1)
encoded = cbor2.dumps(orig,default=default)
print(encoded)
decoded = cbor2.loads(encoded) # <--- throws CBORDecodeEOF
assert vars(orig) == decoded

However, in reality, the encoder expects the default callback to instead call encoder.encode(...) to serialise exactly one value. If you forget to do that, you silently end up with an invalid CBOR stream.

Thus, I believe this manifests two related but separate issues:

The fix for the first one is obvious (in fact, further down the documentation, there is an example that shows the right use of the API - otherwise I wouldn't have figured this out at all).

I argue the second issue should be fixed as-well; it seems too easy to accidentally serialise either none or multiple values to the encoder, and the consequences are too dire and too difficult to debug. Because of the misuse potential for the current API, I further believe that the json-like API is in fact better, but it may be too late to change the API at this point for backwards compatibility reasons. In this case, I suggest that the encoder actively count the number of objects serialised within a call to default and raise an exception if that number is not one. Care must be taken when default= gets called recursively though.

How can we reproduce the bug?

See MWE above.