Combine cboar and cbor2?

Not so much an issue this one as a place for discussion of the possibility, raised in #44, of eventually combining the cboar and cbor2 projects (assuming I get cboar finished and released :).

The case for merging:

All cbor2 users get a decent, free speed boost
There's very little sacrifice of flexibility (though there is some - I've already locked down the initial set of encoders in cboar as an optimization because it led to a substantial speed boost)
There's no sacrifice of functionality (it's basically the same API, right down to the constructor arguments for CBOREncoder and CBORDecoder, though at the moment I haven't bothered with separate exception classes because I thought that bit of cbor2 was a bit odd - naturally I'd change that if the projects were to merge because backwards compatibility would be a concern)

The case against merging:

A C-based library would be less approachable for people to hack on
A C-based implementation doesn't have the fairly strong guarantee against memory leaks that a pure Python version has
My original thoughts for cboar were that it would remain an "optimized cbor2", i.e. cbor2 would be its "upstream" defining the functionality and API, and that for 99% of users it would be the library of choice while cboar would exist for the 1% running on weird platforms or with weird requirements who needed the speed and were prepared to sacrifice a little flexibility and the memory leak guarantees

Anyway, I'm happy to be persuaded either way but before anybody at the cbor2 end makes a choice I should probably outline some of the changes I've made in cboar (there's not many and they're fairly subtle but they might be important). I'll try and find the time to document more of those tonight (or if you're happy perusing C, go have a look at the code!).

Discuss!

I guess we could add memory leak tests to the test suite? Also, if the code was written in Cython instead, it would be much more approachable to hackers.

I think we should explore this possibility before cboar is released, to avoid clutter on PyPI.

I guess we could add memory leak tests to the test suite? Also, if the code was written in Cython instead, it would be much more approachable to hackers.

Quite possibly - I can't say I know much about Cython (beyond vaguely "what it does") so I couldn't comment whether similar performance gains could be made with that. That said, cboar's code is heavily reliant on the CPython API rather than on "pure C" (if that makes any sense) so I could certainly believe a Cython implementation would be similarly speedy.

It may depend on how things like lists, tuples and dicts are constructed in Cython (my hunch, unverified, is that a fair chunk of cboar's speed comes from the ability to pre-allocate a container and fill its slots afterwards which can't be done at the Python level).

I think we should explore this possibility before cboar is released, to avoid clutter on PyPI.

Working on piwheels, I'm rather more aware than most of just how much clutter there is on PyPI :) That said, I don't think another package makes much difference (my original intent was to release cboar with a big notice at the front stating "Go use cbor2! No, really - you don't need this unless you've actually used cbor2, profiled your system and found that it specifically is a performance issue"). Still, I'm happy to hold off on that - we've no need to have cboar on PyPI just to use it in piwheels.

Quite possibly - I can't say I know much about Cython (beyond vaguely "what it does") so I couldn't comment whether similar performance gains could be made with that.

Since Cython allows you to write native code that looks almost like Python, it would most certainly be a good idea if the performance benefits could be transferred to Cython code. Doing this would also help avoid memory leaks and other nastiness that can happen on the C level.

That said, cboar's code is heavily reliant on the CPython API rather than on "pure C" (if that makes any sense) so I could certainly believe a Cython implementation would be similarly speedy.

That's perfectly fine. On PyPy, cbor and cbor2 have almost the same performance anyway. CPython is where the speedups are really needed.

Since Cython allows you to write native code that looks almost like Python, it would most certainly be a good idea if the performance benefits could be transferred to Cython code. Doing this would also help avoid memory leaks and other nastiness that can happen on the C level.

I'm intrigued - I should have a look at Cython tonight (see whether I've been wasting my time with C ;)

Have had a good play with Cython now, including a rudimentary port of cbor2 to it (which I've named "cybor" for testing purposes), and I'm close to finishing the cboar C-based port too. A quick summary of the results so far, and my current thoughts regarding them:

The Cython port was ludicrously easy to get started with and, at the small scale, it performs quite impressively. In fact, for single values cybor manages to beat cboar in performance! This appears to be because Cython takes advantage of a fast-call mechanism so it "binds tightly" with pure Python code, rather more so than things written directly against the CPython API which jump through all sorts of hoops to make calls to Python-implemented things.
However, while its performance at the small scale is impressive, the C-based cboar implementation rapidly overtakes it as encoded/decoded values become more complex. Even a simple three-element list is enough to show this difference (see table below).
I've spent a little time optimizing cybor (the Cython port) in the obvious places and, although it's made some difference (I've managed to eek out about a 20% boost in some areas), the returns rapidly diminish. It seems that Cython is best suited to cases where a large chunk of data can be dealt with - uninterrupted - in code that translates directly to C. Sadly, that doesn't really apply to cbor2's case where there's (of necessity) a lot of jumping back and forth to Python to construct values, append them to lists, dicts, etc.

My current thinking is as follows:

The Cython port is interesting ... possibly interesting enough to influence cbor2 in some ways especially given the (fairly) minimal code changes (I need to write a bit more explaining what I've changed and why but it's too late tonight!)
However, it's nowhere near cboar in terms of performance (at larger scales), and I don't get the impression it realistically can be thus it's probably not worth pursuing further at this time.

Anyway, some links for those interested:

cybor - the Cython port - probably best to look at this first as the code should be "recognizable" to anyone familiar with cbor2's guts
cboar - the pure C port - the internal structure mirrors that of the Cython port (well, actually it was the other way around, but never mind!)

And here's some rudimentary speed-test results of cbor (the other C implementation), cboar, cybor, and cbor2. I should mention I only include cbor as a lower-bound of what is possible - I don't realistically expect to get cboar anywhere near its figures because the internal designs are quite different (and I'm of the opinion that cbor2's design is considerably "cleaner" and indeed safer in several areas):

                 +---------------------------------------+---------------------------------------+
                 |               Encoding                |               Decoding                |
+----------------+---------+---------+---------+---------+---------+----------+---------+--------+
| Test           | cbor    | cboar   | cybor   | cbor2   | cbor    | cboar    | cybor   | cbor2  |
+----------------+---------+---------+---------+---------+---------+----------+---------+--------+
| None           | 334.3ns | 5.6µs   | 5.3µs   | 5.6µs   | 198.6ns | 1.7µs    | 1.5µs   | 1.7µs  |
| 10e0           | 360.5ns | 5.6µs   | 6.1µs   | 6.4µs   | 200.9ns | 1.8µs    | 1.4µs   | 1.7µs  |
| 10e12          | 409.0ns | 5.8µs   | 6.4µs   | 6.6µs   | 234.4ns | 1.8µs    | 1.9µs   | 2.5µs  |
| 10e29          | 1.6µs   | 7.5µs   | 8.8µs   | 9.7µs   | 913.3ns | 2.8µs    | 2.5µs   | 6.2µs  |
| -10e0          | 359.0ns | 5.8µs   | 6.3µs   | 6.9µs   | 199.9ns | 1.8µs    | 1.3µs   | 1.8µs  |
| -10e12         | 409.8ns | 5.9µs   | 6.7µs   | 7.1µs   | 239.1ns | 2.0µs    | 2.0µs   | 2.7µs  |
| -10e29         | 1.7µs   | 7.7µs   | 9.0µs   | 9.9µs   | 946.3ns | 3.0µs    | 2.8µs   | 6.5µs  |
| float1         | 343.5ns | 5.8µs   | 6.4µs   | 7.3µs   | 223.8ns | 1.8µs    | 2.1µs   | 2.5µs  |
| float2         | 344.1ns | 5.9µs   | 6.3µs   | 7.3µs   | 224.5ns | 1.8µs    | 2.0µs   | 2.5µs  |
| str            | 402.5ns | 5.9µs   | 6.5µs   | 7.1µs   | 301.2ns | 1.9µs    | 1.9µs   | 2.7µs  |
| bigstr         | 1.6µs   | 6.7µs   | 7.5µs   | 8.3µs   | 1.9µs   | 3.9µs    | 3.9µs   | 4.7µs  |
| bytes          | 356.2ns | 5.6µs   | 5.9µs   | 6.4µs   | 217.6ns | 1.8µs    | 1.5µs   | 2.1µs  |
| bigbytes       | 1.0µs   | 6.8µs   | 6.7µs   | 7.5µs   | 575.1ns | 3.0µs    | 2.6µs   | 3.3µs  |
| datetime       | -       | 9.2µs   | 14.8µs  | 15.3µs  | 2.5µs   | 4.0µs    | 6.1µs   | 8.3µs  |
| decimal        | -       | 14.1µs  | 27.1µs  | 29.5µs  | 2.5µs   | 3.7µs    | 6.5µs   | 9.5µs  |
| fraction       | -       | 16.3µs  | 27.0µs  | 29.4µs  | 2.5µs   | 4.5µs    | 8.7µs   | 11.1µs |
| intlist        | 432.9ns | 6.1µs   | 9.7µs   | 10.8µs  | 312.9ns | 2.1µs    | 3.1µs   | 4.7µs  |
| bigintlist     | 62.2µs  | 253.0µs | 1.7ms   | 2.8ms   | 41.6µs  | 209.4µs  | 643.5µs | 1.8ms  |
| strlist        | 563.2ns | 6.4µs   | 10.4µs  | 12.0µs  | 483.0ns | 2.6µs    | 4.4µs   | 7.6µs  |
| bigstrlist     | 172.3µs | 411.8µs | 2.2ms   | 3.4ms   | 185.0µs | 630.3µs  | 1.4ms   | 4.1ms  |
| dict           | 572.5ns | 6.7µs   | 12.6µs  | 15.8µs  | 406.5ns | 2.9µs    | 4.5µs   | 9.6µs  |
| bigdict        | 437.9µs | 492.6µs | 1.8ms   | 2.8ms   | 584.1µs | 1006.2µs | 1.9ms   | 3.8ms  |
| set            | -       | 6.4µs   | 11.7µs  | 13.3µs  | 2.6µs   | 2.5µs    | -       | 7.9µs  |
| bigset         | -       | 96.3µs  | 616.8µs | 995.2µs | 27.5µs  | 154.3µs  | -       | 1.1ms  |
| bigdictlist    | 1.1ms   | 2.4ms   | 14.1ms  | 23.2ms  | 1.6ms   | 5.0ms    | 9.2ms   | 30.8ms |
| objectdict     | -       | 11.3µs  | 24.4µs  | 30.1µs  | 3.2µs   | 5.8µs    | 11.7µs  | 21.0µs |
| objectdictlist | -       | 360.9µs | 1.6ms   | 2.2ms   | 276.5µs | 310.6µs  | 928.6µs | 1.9ms  |
+----------------+---------+---------+---------+---------+---------+----------+---------+--------+

Time to document the changes I made to cbor2's design in making cboar, and the experimental cybor port. It's probably easiest to look at cybor to see these changes as it's much closer to the original Python (in some respects), but rest assured that what I've done in cybor is basically what cboar does too (mostly). In the sections below I've noted which improvements could be ported to cbor2, if desired:

Functions to Methods

The first major change was to move all the encoding and decoding routines into methods on the CBOREncorder and CBORDecoder classes.

I should stress at this point this wasn't out of some OO-purist desire (I like cbor2's "they're just functions" design), but more because I knew this would be rather more efficient in the C translation if I could make all encoding and decoding methods use the "METH_O" calling convention; in CPython there's several prototypes for calling methods and functions, and METH_O is the most efficient by some margin as it doesn't involve (much) messing around with tuple and dict construction just to make a function call. METH_O allows two parameters: the self parameter (to pass around the state of the encoder or decoder) and a single other parameter (in the encoder case, the value to encode). All the other prototypes allow multi-positional args and keywords, but have considerably higher calling overhead (well, except METH_NOARGS but obviously that's even more limited!). If I'd stuck with functions, the "self" parameter (in METH_O, which also applies to functions) becomes the module owning the function and I wouldn't be able to carry the encoder / decoder state into the call, hence the move to methods.

This change could be ported to cbor2 but firstly I've no idea if it would make any performance difference there (it might - I haven't investigated) and secondly it might constitute a major API change (I don't think it's common for users to call the encoding/decoding functions directly ... but it might be?).

Storing read/write instead of fp

When I was profiling cbor2 in my original attempts to optimize it, one thing that stood out was that an awful lot of method lookups were being performed. Specifically every time the encoder writes or the decoder reads, they look up the write or read method of their "fp" object. By contrast, cboar (and cybor) have a calculated "fp" property (which acts in the same fashion as cbor2's attribute) but internally they store the method to call saving all those lookups.

This change could be implemented in cbor2 with no major implications, just a small performance boost (I've measured it at nothing for the small scale of encoding a single value, and about 10% improvement for large lists and dicts).

No concatenation of bytes and strs

In encode_bytestring and encode_string the current algorithm concatenates the length prefix and the bytes to write. This looks fine, but it's potentially extremely costly when dealing with a huge byte-string (like an embedded binary file) as it requires allocating an even larger (byte-)string to pass to the write method. Better to just write the length, then separately write the content.

This can be implemented in cbor2 with no implications other than better performance at the large scale.

Pre-allocation of lists and dicts

In decode_array and decode_map, when dealing with definite length objects, cbor2 currently constructs an empty list and appends values to it. In cboar, we take advantage of some CPython's low-level APIs to pre-allocate the lists or dicts at the desired size. At the small scale (single values) this makes almost no difference, but rapidly becomes a major source of gains at larger scales (e.g. lists of 1000+ values). Unfortunately, in cybor (the Cython port) and potentially in cbor2 it isn't possible to replicate this completely (without directly calling the CPython API, but at that point one arguably has even uglier code than cboar).

However, there is one trick that can be played to improve list decoding speed: construct the list as [None] * n (effectively pre-allocating the desired number of slots) then overwrite the slots with decoded values. This trick can be seen in cybor's code and could be implemented easily in cbor2 with no side-effects. That said, while this makes a difference in cboar, I've no idea if it does in cybor or would in cbor2 (I haven't profiled it on these implementations).

Unfortunately I don't know of any trick that can efficiently pre-allocate a dict in pure Python.

Elimination of shareable_index

The one and only bit of cbor2's design I'm not so keen on is the value-sharing logic. Specifically, the passing of shareable_index around as an optional parameter on all the decoding methods is unnecessary; this can be kept as an internal instance variable and as a result all the special cases for sharing in decode_semantic can be moved into a dedicated semantic decoder for the type. This can be seen in cybor's decode_shareable (also have a look at the much cleaner decode_semantic implementation which has no special cases for sharing or sets).

I'd like to port these changes to cbor2, but I can't claim it has any performance benefits (it might do by simplifying the calling conventions - a surprising amount of Python's time is spent on calling functions - but I haven't measured it in this case)! That said, I do think the code is a bit cleaner and more maintainable as a result.

Simpler decimal encoding

The current algorithm used in cbor2 for encoding Decimal numbers is a little expensive, involving a sum of powers. This can be simplified to a loop repeatedly multiplying by 10.

This can be ported to cbor2 with no side effects.

Hard-coded type lookup

One of the major optimizations in cboar (responsible for halving encoding times at larger scales), but probably a more controversial one, is the hard-coded type lookup in the encode method. Instead of immediately going to the encoders dict, cboar first runs through some "exact" type checks (which become simple equality comparisons once expanded).

While very fast, these obviously reduce the flexibility of the encoder. One might think that it could still be customized by merely overriding the encoding methods but that won't work either as they're directly calling from the encode method without going back to CPython to find out if they've been overridden (this is a significant part of the speed-up). That said, I felt it was an acceptable trade-off partly on the basis the entire point of the port was speed, and partly because I don't think it's that common a desire to override the base encoders.

In order to still allow the same level of customization that cbor2 provides, there's a bit of a dirty hack in cboar's API. The "canonical" parameter plays a dual role:

When it is False or 0 (the default), the hard-coded lookup table is used for speed, falling back to the encoders dict if the type can't be matched.
When it is True or 1, a modified hard-coded lookup table is used, substituting float, map, and set encoding for their canonical counter-parts. Again, the encoders dict is used as a fall-back in case the type cannot be matched.
When it is another other value (e.g. 2), the hard-coded lookup tables are ignored, and only the encoders dict is used, permitting the same level of customization as in cbor2.

Obviously this is an API change, and I'm not sure if it would be acceptable to cbor2. Still, something similar could be ported to cbor2 and might improve performance (while it certainly did in the cboar port, I haven't tested anything similar in cbor2 so again I'm not sure).

Conclusions

As might be obvious from the above, at the moment my thinking is that the two projects shouldn't be merged, but that cboar might influence a little of cbor2's design. My reasons are:

The Cython-based cybor port, while very interesting (Cython is impressively easy to learn!), doesn't look likely to produce anything like the speed gains of the pure-C cboar port. If it had, I would happily have supported that route as it would mean fairly minimal code changes, and a library supporting both CPython and potentially PyPy (my understanding is that Cython has the ability to produce PyPy compatible output, though I haven't tried it).
While I think the cboar code is reasonably clean, that's a far cry from being as readable and maintainable as cbor2 (there's almost no Python left in cboar - it's almost entirely C at this point). Furthermore, the lack of compatibility with PyPy (which is a great and simple way to get a good performance boost for those cases where it's sufficiently mature), makes cboar an unsuitable replacement for cbor2.
I've now built some rudimentary leak tests into cboar, and killed off all the obvious ref-leaks but there's probably some still lurking in esoteric corners of the code. I've no idea (yet!) how someone can guarantee there're no ref-leaks in a C-based python library.

Basically I'd like to see cbor2 as the "primary" project: the "upstream design source" for cboar if you will. To that end, if it would be acceptable to @agronholm and @Sekenre I'd like to put together a series of PRs covering the changes detailed in the sections above to see if I can draw the designs of the two projects (cbor2 and cboar) as close together as possible, then release cboar (with the great big "don't use this 'til you've tried cbor2" caveat in the README) and continue to maintain it as a downstream variant of cbor2.

Thoughts?

This is great stuff. I've put my thoughts below each of your headings.

Functions to Methods

I have no objection to this. It would be worth testing the overall impact on performance.

Storing read/write instead of fp

This is a good idea, I had done some experiments with writing to a bytearray buffer but while it was good for the decoder, it made the encoder significantly more complex.

No concatenation of bytes and strs

I like this one.

Pre-allocation of lists and dicts

Seems sensible for lists except when it's an indefinite length

Unfortunately I don't know of any trick that can efficiently pre-allocate a dict in pure Python.

I had a look at the CPython code, and it seems that the dict.fromkeys() class method pre-allocates the dictionary. It might also make a difference to decode all of the key, value pairs into a pre-allocated list and then call dict(iterable) on the result:

def decode_map(self):
  length = self._decode_length(subtype, allow_indefinite=True)
  pairs = [ None ] * length
  for n in range(length):
    with self.immutable_value, self.disable_value_sharing:
      key = self._decode()
    with self.disable_value_sharing:
      value = self._decode()
    pairs[n] = (key, value)
  return dict(pairs)

I also found in testing that using a context manager for managing a couple of binary flags was really slowing things down for maps with many keys.

Elimination of shareable_index

I have no opinion here, I don't tend to use the shareable index.

Simpler decimal encoding

I like this one. I plan to use Decimal quite a bit.

Hard-coded type lookup

I would always prefer fast encoding of basic data types over customization, since every python object that you might want to serialize can be represented as a tagged collection of basic items.

Canonical encodings are meant to provide a simple way to standardize the byte-level representation of the protocol you are designing with CBOR. For example, you could say that your protocol must only consist of a header of 5 integers in a list followed by a list of 16-bit floats. On the encoding side it would be almost entirely hardcoded to your specific use case, but it could be parsed by a generic CBOR decoder. Eventually I would like to support this with a Schema style interface.

The recommended ordering of keys in a Canonical map from the RFC has already changed once and seems to be intended to allow fast validation and comparison at the byte level. This is not really something that comes up in Python except when you are testing some externally specified protocol.

Conclusions

Would it be possible to import cboar as an optional dependency in cbor2 and fall back on pure Python when it's not installed or we're running on PyPy?

By the looks of things, most of the changes are acceptable - I'll start prepping PRs to bring cbor2 as close as possible to cboar's API starting with the simplest / least-invasive bits. We can decide whether the more invasive bits (I'm thinking particularly functions->methods) are worthwhile and/or merit a major version bump as a non-backwards-compatible change.

Would it be possible to import cboar as an optional dependency in cbor2 and fall back on pure Python when it's not installed or we're running on PyPy?

An intriguing idea; slightly embarrassed I didn't think of that! As long as the APIs are kept sufficiently close that only extremely esoteric cases trip over the difference, it could be feasible.

I had a look at the CPython code, and it seems that the dict.fromkeys() class method pre-allocates the dictionary. It might also make a difference to decode all of the key, value pairs into a pre-allocated list and then call dict(iterable) on the result:

Nice idea! I'll incorporate that into the changes.

I also found in testing that using a context manager for managing a couple of binary flags was really slowing things down for maps with many keys.

Interesting - in cboar obviously I don't have context managers in C, so it's just done with a couple of arguments to an internal function. I'll run some tests and see if that's a better approach here too (my use of context managers here was largely to keep things similar to the disable_value_sharing context manager in the encoder).

An intriguing idea; slightly embarrassed I didn't think of that! As long as the APIs are kept sufficiently close that only extremely esoteric cases trip over the difference, it could be feasible.

Standard practice (as of Python 3) is to name the module _cbor2 and bring it in as a conditional import.

All in all, these ideas sound interesting. Be careful with value sharing though – I could not recall what reason I had to add the index as an argument to all decoders, but I remember grinding my teeth about it.

All in all, these ideas sound interesting. Be careful with value sharing though – I could not recall what reason I had to add the index as an argument to all decoders, but I remember grinding my teeth about it.

Yes, I think I ran into what you're alluding to when I started messing with this stuff. The shareable_index argument works as it effectively builds a stack of shareable indexes through the call-stack (for nested shareables) and allows suppression of sharing during decoding (by setting it to None rather than passing the caller's value along, as shown in cbor2's current decode_semantic function: only the first decode() call within that function passes the shareable_index along).

The alternative arrangement in cboar (and cybor) required an additional method (or an additional parameter during decoding) to temporarily suppress the value of the internal _share_index attribute but I think it achieves the same level of functionality without burdening all decoders with an additional parameter, and without giving decoders the ability to specify an arbitrary index when specifying a shareable (which was the bit that didn't sit well with me).

Anyway, just a quick heads up: I'm very nearly done with my branch merging the two projects. The result is (I think) quite nice: the API doesn't change, everything remains compatible with all the current versions of Python supported, but the C-module is only built for CPython implementations of version 3.3 or later. The test suite has been updated to handle testing both the C and the pure Python implementations where possible and the result runs happily with all versions under tox. The alterations to the pure Python implementation mentioned above are included; in my tests they do improve performance but not by as much as I'd hoped (overall there's <10% improvement, with the exception of decimal handling which is about 20% quicker). Still, that's better than nothing!

I'd originally intended to submit this as a series of PRs but as I've been putting everything together it quickly became obvious this wasn't going to be a simple 4 or 5 commits, but more like 15 or so. The resulting mess of PRs would be difficult to work with in the required order (commit one before another and the result would require rebasing). As a result I've decided to put together one mammoth PR, but consisting of a series of (hopefully) coherent simple commits which can be looked at and reviewed individually (I'm happy to rebase and edit / remove individual commits as required after review).

Nearly there!

Oh, before I forget, the current branch is here if you want a sneak preview: https://github.com/waveform80/cbor2/tree/c_module - I've got one more bit to add to that then I'll do a final tidy-up-rebase and submit it.

Ok I had a skim of your preview. It was pretty easy to follow (that's why it helps to keep the same API!) so I'm happy to review it as a giant PR.

Thanks again.

I'm going to close this off once I've updated all of the release info.

This has now been released as 4.2.0 :tada:

(minus some documentation updates)

agronholm / cbor2