Open simonw opened 7 months ago
A really cheap trick for dealing with unexpected not-JSON-serializable items is to use this pattern:
json.dumps(obj, default=repr)
This will use the repr()
version of anything that JSON doesn't know how to serialize by default. For binary stuff you'll end up with a "b'binary stuff here'"
representation, something like this:
{
"id": 1,
"blob_column": "b'\\x00\\x00\\x00\\x14ftypqt \\x00\\x00\\x00\\x00qt \\x00\\x00+\\x00moov\\x00\\x00\\x00lmvhd\\x00\\x00\\x00\\x00\\xe1\\xf7\\x1d\\x93\\xe1\\xf7\\x1d\\x96\\x00\\x00'"
}
Just to check, this is when accessing files with a json and/or jsonl extension?
I think so - but even just running ls
seems to trigger those errors, if there are tables with BLOB columns present.
even just running ls seems to trigger those errors, if there are tables with BLOB columns present.
I think the reason this happens is that ls
gets the file size, which at the moment requires reading the whole file contents.
More substantively, I've made some progress on this, and I have a test case for the bug.
I've tried your suggestion using repr
, but I'm not sure it's the right solution. It results in JSON objects which contain a string value of e.g. "b'abc'"
, and converting that back to bytes is not part of the JSON spec (pandas, for one, doesn't handle this).
Are there any tools which do support this representation by default? Alternatively, is there a serialization which languages other than Python would more unambiously interpret as bytes — like an array of integers or a hex or base-64 encoded string?
Just a drive-by comment because I was looking into this a while ago. This might help if you are looking to use something that is generally accepted by the ecosystem.
contentEncoding
to base64
and encode the contents using Base64.”!!binary
tag (or the full tag:yaml.org,2002:binary
URI) to inform parsers that the string is encoded binary (see Binary Data Language-Independent Type). An example in the YAML spec: Example 2.23 Various Explicit Tags.I understand that Python’s binary strings are easier to do as this is a Python project, but for the wider ecosystem I would probably recommend base64. It might even be interesting to generate a JSON Schema file to identify the binary fields, this way editors that support JSON Schema will be able to enforce these fields being base64 encoded. (Might also be interesting for future write support.)
I'm seeing this error in my console - everything still works, but presumably that's caused by tables with BLOB columns that can't be represented as JSON?
I use this format to solve that:
See https://simonwillison.net/2020/Jul/30/fun-binary-data-and-sqlite/