adamobeng / wddbfs

webdavfs provider which can read the contents of sqlite databases
MIT License
155 stars 3 forks source link

TypeError: Object of type bytes is not JSON serializable #1

Open simonw opened 7 months ago

simonw commented 7 months ago

I'm seeing this error in my console - everything still works, but presumably that's caused by tables with BLOB columns that can't be represented as JSON?

I use this format to solve that:

{
  "id": 1,
  "blob_column": {
    "$base64": true,
    "encoded": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAY..."
  }
}

See https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sqlite/

simonw commented 7 months ago

A really cheap trick for dealing with unexpected not-JSON-serializable items is to use this pattern:

json.dumps(obj, default=repr)

This will use the repr() version of anything that JSON doesn't know how to serialize by default. For binary stuff you'll end up with a "b'binary stuff here'" representation, something like this:

{
  "id": 1,
  "blob_column": "b'\\x00\\x00\\x00\\x14ftypqt  \\x00\\x00\\x00\\x00qt  \\x00\\x00+\\x00moov\\x00\\x00\\x00lmvhd\\x00\\x00\\x00\\x00\\xe1\\xf7\\x1d\\x93\\xe1\\xf7\\x1d\\x96\\x00\\x00'"
}
adamobeng commented 7 months ago

Just to check, this is when accessing files with a json and/or jsonl extension?

simonw commented 7 months ago

I think so - but even just running ls seems to trigger those errors, if there are tables with BLOB columns present.

adamobeng commented 7 months ago

even just running ls seems to trigger those errors, if there are tables with BLOB columns present.

I think the reason this happens is that ls gets the file size, which at the moment requires reading the whole file contents.

More substantively, I've made some progress on this, and I have a test case for the bug.

I've tried your suggestion using repr, but I'm not sure it's the right solution. It results in JSON objects which contain a string value of e.g. "b'abc'", and converting that back to bytes is not part of the JSON spec (pandas, for one, doesn't handle this).

Are there any tools which do support this representation by default? Alternatively, is there a serialization which languages other than Python would more unambiously interpret as bytes — like an array of integers or a hex or base-64 encoded string?

Zegnat commented 6 months ago

Just a drive-by comment because I was looking into this a while ago. This might help if you are looking to use something that is generally accepted by the ecosystem.

I understand that Python’s binary strings are easier to do as this is a Python project, but for the wider ecosystem I would probably recommend base64. It might even be interesting to generate a JSON Schema file to identify the binary fields, this way editors that support JSON Schema will be able to enforce these fields being base64 encoded. (Might also be interesting for future write support.)