N^2 Performance on decoding large definite bytestrings

kpwelsh commented 3 weeks ago

Things to check first

[X] I have searched the existing issues and didn't find my bug already reported there
[X] I have checked that my bug is still present in the latest release

cbor2 version

5.6.4

Python version

3.9, 3.10, 3.12.1, 3.12.3

What happened?

I work with a lab that receives fairly high frequency, large, compressed images in a CBOR format off of a piece of hardware. Frequencies range from 10hz to 200hz, and the size of the compressed images range from 1MB to 24MB.

After failing to read messages quickly enough, we discovered that cbor2.loads was the culprit and >90% of the time was spent decode.c:613/614.

I understand that this N^2 memcopy was implemented in #204 to address some memory crash related issues with malicious inputs. However this implementation is unacceptable for us and we have switched to a different python cbor implementation.

I would prefer to continue using this one, as it is clearly more active/supported :).

How can we reproduce the bug?

Easy to reproduce, and easy to see from the code. Here is a script that makes a plot, though, comparing it to cbor :).

import time
import cbor2

MB = 1024 * 1024
messages = [cbor2.dumps(b"0" * (i * MB)) for i in range(1, 20)]
sizes = [len(message) for message in messages]

cbor2_times = []
for message in messages:
    start = time.time()
    cbor2.loads(message)
    end = time.time()
    cbor2_times.append(end - start)

import cbor

cbor_times = []
for message in messages:
    start = time.time()
    cbor.loads(message)
    end = time.time()
    cbor_times.append(end - start)

import matplotlib.pyplot as plt

plt.plot(sizes, cbor2_times, label="cbor2")
plt.plot(sizes, cbor_times, label="cbor")
plt.xlabel("Message size (MB)")
plt.ylabel("Time (s)")
plt.legend()
plt.show()

agronholm commented 3 weeks ago

I can see why this would have N^2 performance. But if I were to preallocate the entire thing, it would allow easy attacks using malicious data that specifies a gigantic bytestring without actually providing that data. On the other hand, if I just store an array of buffers, I would eventually need to copy them into a larger one, briefly requiring 2x the amount of memory.

Suggestions?

kpwelsh commented 3 weeks ago

Briefly requiring 2x the memory is not an issue for us.

I appreciate the guard against the attacks, and the desire not to regress there.

Could we:

Allow a "trusted" version of loads that doesn't worry about it? In our case at least we are communicating on a secure LAN between our own managed processes, so malicious attacks is not a concern.
Check to see if we might run out of memory and fall back to the safer load if it might be an issue?

Disclaimer: no idea how to do 2.

kpwelsh commented 3 weeks ago

Note:

After realizing that the other cbor library doesn't support RFC 8949 and only supports 7049, we switched back to this one using the pure Python implementation, but made our own decoder subclass that doesn't use a chunked buffer copy.

This should be an acceptable solution for a while (perhaps forever), but it would be nice to use a C implementation here :). The difference does not completely break our workflow, though.

Changaco commented 3 weeks ago

I suggest using an mmap instead of a bytearray. Something like this:

                 left = length
-                buffer = bytearray()
+                try:
+                    buffer = mmap(-1, left)
+                except (OSError, OverflowError):
+                    raise CBORDecodeValueError("invalid length for bytestring 0x%x" % left)
                 while left:
-                    chunk_size = min(left, 65536)
-                    buffer.extend(self.read(chunk_size))
-                    left -= chunk_size
+                    chunk = self.read(min(left, 65536))
+                    buffer.write(chunk)
+                    left -= len(chunk)

                 result = bytes(buffer)

and its equivalent in C.

Changaco commented 3 weeks ago

In pure Python the “conversion” to a bytestring at the end is actually a copy, so both the current code and the one I suggested require 2× the memory. In C it should be possible to create a bytestring pointing directly to the mmapped data, thus avoiding the copy.

agronholm / cbor2