cognitect / transit-format

A data interchange format.
1.87k stars 36 forks source link

Use MsgPack bin instead of base64 #2

Open drewcrawford opened 10 years ago

drewcrawford commented 10 years ago

I maintain probably what is the only reasonable MsgPack library for ObjC (and, soon, Swift). I also support a large number of what you call "extension types", but in my case they go way beyond primitive types like URLs/dates into things like custom classes. What I'm saying is, I have spent a lot of time in this problem space.

To me the advantage of a non-JSON scheme is performance. Sure, you could define a set of extensions just for JSON, but why do that when alternate encoders like MsgPack are so much more efficient for non-JS implementations. We're on the same page there.

However, the decision of base64ing the bytes is a complete non-starter for me. It bloats the size and takes longer in transit, longer to encode/decode, etc. MsgPack v2 has a perfectly adequate, binary, non-string type for you to target. Efficient transport of byte arrays is actually the thing that motivated me to write an MsgPack library in the first place.

Sure, it means you have a difference between JSON/MsgPack representations but that's already the case for other types.

timewald commented 9 years ago

The original plan was to use binary in msgpack, and we will return to that if we can. We moved away from it because some of the msgpack libs we're using do not (yet) distinguish binary and string types while reading, presumably aligned with (on interpretation of) the implementation guidance here: https://github.com/msgpack/msgpack/blob/master/spec.md#impl-upgrade. Once we get msgpack libs across the required platforms that implement the full string vs binary split, we will move back to binary data in msgpack in place of base64.

drewcrawford commented 9 years ago

The problem is that these implementation changes will not be forthcoming. Or, to put it another way: I realized the existing implementations weren't going to implement the new spec back in 2013 and so I wrote my own. So far my hypothesis has largely been correct.

Transit didn't create this mess, but nobody else is going to clean it up. The existing MsgPack users are all pretty satisfied with how things are.

The trouble is that Transit doesn't have any users (outside of Cognitect?) and the burden it has to overcome to collect early adopters is that it has to be better than whatever duct tape approach the early adopters are using today.

My duct tape approach includes MsgPack bin support, and I've already got working implementations for the 2 languages I care about today.

A response of "Well, nobody is working on this for [language you don't use] so we can't support bin yet" will not convince me to put away my roll of duct tape and join forces with Transit. As long as the duct tape actually works better (!) than a more formal solution I'm likely to stick with it.

On Jul 24, 2014, at 9:52 AM, Tim Ewald notifications@github.com wrote:

The original plan was to use binary in msgpack, and we will return to that if we can. We moved away from it because some of the msgpack libs we're using do not (yet) distinguish binary and string types while reading, presumably aligned with (on interpretation of) the implementation guidance here: https://github.com/msgpack/msgpack/blob/master/spec.md#impl-upgrade. Once we get msgpack libs across the required platforms that implement the full string vs binary split, we will move back to binary data in msgpack in place of base64.

— Reply to this email directly or view it on GitHub.

jrus commented 9 years ago

Is this still the case a year later? Any plans to fix it ever? Seems to me like it makes transit-msgpack a significantly less desirable format for any kind of data with lots of raw data, e.g. images, audio, geospatial data, etc. etc.

Is there a list of the offending messagepack implementations somewhere, so we can go pester them to fix their shit, or even submit patches? Which ones are the “required platforms” from Cognitect’s perspective?

I’d love to have some kind of good generic base to build on whenever I need to make a new data serialization and exchange format, or even to transcode various existing files into a semantically equivalent but more easily parsed / more standard format, instead of writing and optimizing special-purpose parsers in every language where I need one. Transit-msgpack seemed like a decent basis for that kind of thing, but if every bit of raw data needs to be base64-encoded, that seems like a huge waste of both filesize and parsing time.

Qqwy commented 2 years ago

What about adding support for Msgpack v2 as a separate encoding format option, just like both json and json_verbose are supported?