dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.2k stars 140 forks source link

use msgpack for json serialization/deserialization #1539

Closed rudolfix closed 1 month ago

rudolfix commented 1 month ago

Background We use orjson as our main json serializer/deserializer. We had problems with library being unstable ie.

Tasks

    • [ ] make sure it can be used at all :) does it work without defining types?
    • [ ] can we make it to pass our json test suite? note that we use custom serializers, need to be able to serialize in deterministic order of fields (for generating hashes etc.) so the tests must pass
    • [ ] we should compare speed and memory usage on large, nested json files. ping rudolfix for a test case
    • [ ] if there's a choice we prefer to work over bytes not strings (like orjson)
    • [ ] pretty formatting may be ignored...

If all goes well this ticket will have a followup where we'll add msgpack as optional dependency and update our docs on how to switch it on. so we are able to swap json parsers at any moment

donotpush commented 1 month ago

Msgpack and (orjson or simplejson) use distinct serialization formats. Msgpack utilizes a binary format for serialization. It is not directly human-readable and does not support loading or writing JSON files, as all attempts to process JSON data fail.

Although data can be converted between msgpack and JSON formats (msgpack for binary to/from Python objects), msgpack is designed to handle and manipulate msgpack binary files exclusively.

donotpush commented 1 month ago

I have failed to make an implementation that passes all the tests:

Screenshot 2024-07-15 at 15 12 44

My code can be found in the branch exp/1539-msgspec-json

And the following notebooks is an exploration related to custom encoding (failed test test_json_named_tuple):

https://github.com/dlt-hub/dlt/blob/exp/1539-msgspec-json/notebooks/msgspec_exp.ipynb

rudolfix commented 1 month ago

@donotpush I'll take a look today. IMO there's still hope :)

rudolfix commented 1 month ago

@donotpush thanks for the research. it is clear that msgspec is not a drop-in replacement for our current framework and requires a little more work.

my take

  1. we can close this ticket
  2. we can open another one: use msgspec for json serialization/deserialization and to replace typed-json with msgpack

the idea for the ticket is to:

  1. use its json ecnoder/decoder for pure json. our tests are actually passing (I added key sorting and it works). they can also pretty format json
  2. use its msgpack encoder/decoder for typed_dump - which will produce msgpack document, not our typed-json document.

pls add this followup ticket and move it to todo. we may implement it at some point