Support uint16/uint32/uint64 markers and ND array header (BJData draft 1)

hi team, as indicated in this UBJSON's issue tracker, I am interested in continuing the stalled UBJSON development and its future maintenance.

I derived a MarkDown version of the Draft 12 of the UBJSON spec from your documentation (thank you for that wonderful summary!), and added a few long-desired new features, i.e.

new markers to uniquely map uint16 (u), uint32 (m), uint64 (M) and half/float16 (h) data
an optimized ND-array count construct (# followed by a 1D integer array construct)

The improved UBJSON format, now named as "Binary JData Specification" (BJData) will be officially maintained at

https://github.com/OpenJData/bjdata

The BJData spec is largely backward compatible (i.e. support all existing UBJSON markers), with the only exception that it no longer maps NaN/+infinity/-infinity to Z.

In the past few days, I read through your py-ubjson encoder/decoder, and added the initial support to the new markers and count construct.

I am wondering if you can

help me review the patch and suggest if any additional change I should make
let me know if you are interested in accepting this patch (or revised versions) to py-ubjson so that all existing users is able to gain these functionalities.

the updated C parser already contains a macro USE__BJDATA to enable all added features. If desired, a similar option flag can be added to the pure python encoder/decoder to support UBJSON and BJData explicitly.

the support to float16 (h) marker is incomplete - the e marker to pack/unpack float16 data in python is only supported since Python 3.6. I am not sure what is the preferred way to add this while maintaining backward compatibility. Happy to hear your suggestions.

Dear @fangq, thank your for your submitted PR and apologies for the delay in responding - I'm pretty busy at the moment.

It's nice to hear that you've taken on evolving what started out as UBJSON and I'm glad that my condensed specification document in this repository was useful. However, based on changes in the new BJData 0.5 Draft 1 specification that you propose, I would recommend having a separate package which handles BJData (e.g. by forking this repo as a starting point). If it also happens to be able to handle (older) UBJSON specifications, then even better. I've listed some reasons below for why I think that is the best course of action.

Backwards compatibility

Most people install py-ubjson without specifying a version requirement, i.e. they will download the latest available version. Introducing new markers means that objects encoded with the new proposed version will not be decodable with older/existing/stable versions, nor will they be decodable with other ubjson libraries of the same (Draft 12) level. At this point in time I don't think this would be acceptable given UBJSON and py-ubjson have been around for a while and (most) established libraries provide the same version support. Additionally, even though Python 2 is EOL, there definitely are still some people who use it.

Publishing this as a new independent package on the other hand would signal that BJData is not strictly backwards compatible but a real format in its own right. If you also offered a mode to encode/decode against the (now old) UBJSON spec, that would be very useful and enable people to more easily transition to BJData, if they wanted. Also, you could then choose to only support Python 3, which would make the codebase considerably cleaner.

Ongoing BJData evolution

We currently use ubjson in a production environment where stability & maturity are key. Given BJData is currently versioned at 0.5 Draft 1 and has only appeared last month on GitHub, I assume that further iterations on the spec might happen. (E.g. are your expecting it to be peer-reviewed? Might there be additional changes worth adding?)

Licensing

As I'm sure you've seen, py-ubjson has an Apache 2.0 license and thus you are free to take all resources herein and use them as you see fit (e.g. to create a BJData package), preserving the copyright/licenses notices to show historic origins.

If you'd still like detailed feedback on the changes in this PR, let me know. I'm very happy to have a look through it. A few high-level suggestions I have from the brief look I've taken:

Maybe it's worth considering for the public methods to accept a "version" parameter, which could default to a particular BJData version but let users also specify other version levels, such as the old UBJSON Draft 12 level or other BJData versions in the future. That way it would be easier to remain backwards compatible if you evolve the spec further. (Having a build-time flag I think isn't that useful because typically end-users do not change such flags when they install a package, nor do PyPI packages have a concept of "same version, different behaviour", where two packages releases are made with different build flags, for the same target architecture.)
As per CONTRIBUTING.md, it'd be best to ensure the unit test suite passes and that any new functionality has extra test cases added. (See also coverage_test.sh on how to get both Python & C code coverage results.)

@vtermanis, thanks for the detailed feedback. I will create a py-bjdata package from my fork of your py-ubjson, and maintain it from there (thanks again for making this package in the first place).

Ongoing BJData evolution - We currently use ubjson in a production environment where stability & maturity are key. Given BJData is currently versioned at 0.5 Draft 1 and has only appeared last month on GitHub, I assume that further iterations on the spec might happen. (E.g. are your expecting it to be peer-reviewed? Might there be additional changes worth adding?)

there is an interest to incorporate these new features to the next draft of the UBJSON spec, but the discussion is still on-going, see https://github.com/ubjson/universal-binary-json/issues/109

regardless, I will be happy to maintain the bjdata spec as well as the derived python package separately, until there is sufficient interest in the community/users to adopt these new additions in the future.

If you'd still like detailed feedback on the changes in this PR, let me know. I'm very happy to have a look through it. A few high-level suggestions I have from the brief look I've taken:

these are very good suggestions, and I will try to make the package backward compatible by introducing new flags. May come back to you for questions if any specific function that I am not sure.

closing this PR for now.

Iotic-Labs / py-ubjson