apache / couchdb

Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
https://couchdb.apache.org/
Apache License 2.0
6.11k stars 1.02k forks source link

Add support for MessagePack, or other compact/binary format #2203

Open flimzy opened 4 years ago

flimzy commented 4 years ago

JSON is great for interoperability, but not so much for efficient network communication.

I would love to see a future version of CouchDB that supports a more compact data streaming format, such as MessagePack. (Ideally in both CouchDB and PouchDB, as PouchDB is really where it would be the most useful--but it needs CouchDB support, first).

I haven't looked deeply into it, but I expect MessagePack would be a good choice, as can fairly cleanly be converted 1:1 to/from JSON.

Desired Behaviour

This should not, in any way, replace JSON as the primary content type used by CouchDB. I would see it becoming an optional Accept: and Content-Type: header. The conversion should be trivially handled in a middleware, so it shouldn't need to affect any core logic, and would make for an easy prototype implementation entirely outside of CouchDB.

So for incoming requests (POSTs/PUTs, etc):

If the Content-Type header matches MessagePack, simply convert to JSON, then pass along to normal processing.

If the Accept: header matches MessagePack, perform the normal operation, then before streaming the response, convert from JSON to MessagePack.

Additional context

wohali commented 4 years ago

Previously rejected by the CouchDB team - we explicitly said "no BSON" only. Worth reconsideration, but probably not until after the 4.0 given team bandwidth.

kocolosk commented 4 years ago

Years ago someone on my team hacked up a MessagePack serialization as an experiment. The one place where it ran into problems was our chunked responses, where we start streaming without knowing the full size of the response body ahead of time. MessagePack (and BSON, and many other binary serializations) like to know the size ahead of time so they can pre-allocate appropriately-sized data structures.

kocolosk commented 4 years ago

That said, I'm quite happy to see discussion on this front (and the chunked thing is not an insurmountable issue). JSON is nice and easy but definitely has its weaknesses ...

flimzy commented 4 years ago

The one place where it ran into problems was our chunked responses, where we start streaming without knowing the full size of the response body ahead of time.

Interesting. My understanding was that streaming was supposed to be one of MP's core features/strengths.

kocolosk commented 4 years ago

@flimzy I think the distinction is between streaming a set of individual objects (like our continuous _changes feed) versus streaming serialization of one really large object (_all_docs or _view or normal _changes feeds). When MessagePack talks about streaming it's referring to the former.

flimzy commented 4 years ago

I recently learned about CBOR, which aims for pure JSON compatibility (unlike MessagePack), so may be a better candidate for this sort of feature.

fangq commented 3 years ago

I am interested in developing couchdb/nosql databases for storing hierarchical scientific data - for example, imaging data of different binary types, dimensions, with/without compression - I've developed specifications to enable JSON to encode common scientific data structures (http://openjdata.org/, https://github.com/fangq/jdata/blob/master/JData_specification.md#data-annotation-keywords), but to support binary strongly-typed data, I need to use base64 with JSON, which increases the data file size by ~33%.

To mitigate this, I extended the UBJSON spec (https://ubjson.org/) to support binary data types, see Binary JData (BJData) spec

https://github.com/fangq/bjdata/blob/master/Binary_JData_Specification.md#type_summary

BJData is similar MessagePack/CBOR, but much simpler to encode/decode, and is also quasi-human-readable, despite being a binary format. I will be very happy to contribute if there is an interest to support BJData/UBJSON in CouchDB. I currently have a MATLAB and a Python parser/writer (https://github.com/fangq/pybj, based on py-ubjson, includes both native python code and c-code).

kocolosk commented 3 years ago

@flimzy I happened upon the CBOR spec for an entirely different reason last week and sat down to read it through. I think there's a lot to like. It handles everything I could think of from my experience working with CouchDB. I like that the authors paid special attention to round-tripping through JSON, and while it's a larger topic I particularly like the approach to extensibility that allows CBOR to handle things like timestamps and arbitrary-precision decimals. I could see a future where CouchDB allowed any JSON, but also supported a particular spec based on CBOR that would enable us to introduce a richer set of datatypes in the database.

@fangq as it turns out, when @rkalla was first working on UBJSON he had several discussions with @davisp and @kxepal from the CouchDB community and floated the idea of having CouchDB adopt UBJSON, first as an encoding on the wire but possibly also as an on-disk representation if the wire format took off. That was almost ten years ago, but I can imagine that a UBJSON encoding would be achievable given that history.

fangq commented 2 years ago

just a quick update - I just received a 5-year (U24) grant from the NIH (National Institute of Heath) to help migrate a vast field of neuroimaging software and data to a unified JSON/UBJSON formats so that these valuable neuroimaging datasets (results from millions of dollars of investments from the NIH) can be readable, resuable, and easily migrated to NoSQL databases for scaling and integration. More details will be available at http://neurojson.org

In our first step, we will first convert major neuroimaging data files to JSON, which should be readily usable for a CouchDB query interface. I imagine that with the gradual availability of parsers/libraries, more people may choose to adopt our Binary JData (BJData, largely derived from UBJSON) format for efficiency purposes; so if importing BJData/UBJSON can be discussed in the roadmaps of CouchDB, that would be fantastic for making Couch an attractive tool towards scientific data and research community.

In the meantime, I am coordinating with UBJSON developers on backporting BJData new data markers https://github.com/ubjson/universal-binary-json/issues/109, although, unfortunately the UBJSON development seems to have been stalled over the past few years.

wohali commented 2 years ago

The main limiting factor is the general deprecation of attachments in CouchDB, and the reduction of max document size in CouchDB 4.x to around 10MB. We're not opposed to a richer solution, but it's orthogonal to choosing something like UBJSON/etc, and would have to be solved first.

See prior mailing list discussion:

https://lists.apache.org/thread.html/dd951bda2a8a98f204fa9dd3afbb99b2b12946bb529288829e6fe72c%40%3Cdev.couchdb.apache.org%3E

https://lists.apache.org/thread.html/4a7f29d2356349fa684fca199a6d0817d06e741fba483fd83f6ee59b%40%3Cdev.couchdb.apache.org%3E

https://lists.apache.org/thread.html/4a7f29d2356349fa684fca199a6d0817d06e741fba483fd83f6ee59b%40%3Cdev.couchdb.apache.org%3E

https://lists.apache.org/thread.html/r815b25fe34996ab3c54e2fb9759de2026ddccffc8bb59966b1168063%40%3Cdev.couchdb.apache.org%3E (stopgap change for 3.x to set a finite upper bound for attachments)

fangq commented 2 years ago

thanks @wohali for the heads up - not necessarily related to supporting UBJSON/msgpack, just for the kind of data I work with, attachment feature will be necessary- so I hope it won't be entirely retired.

you can see samples of public neuroimaging datasets, organized via the BIDS standard (https://github.com/bids-standard/) in the below link

https://openneuro.org/public/datasets

so, these public datasets routinely exceed GB in size, although they contains many individual data files/scans, and each file could be much smaller.

My plan is to extract as much as possible the metadata from these data files, and place big binary data chunks into attachments. As long as the max attachment size is adjustable, I can always dial it up when needed. But sounds like the max document size will be a hard limit? 10MB could be pushing considering the scalability of the datasets we will be dealing with.

wohali commented 2 years ago

Attachments aren't completely removed, but >10MB is an open design problem. It's unclear if 4.0 will launch with support for anything >10MB, due to the small FDB transaction size and the lack of consensus in how to craft a solution that works with it.