Open talagluck opened 2 weeks ago
DuckDB has a json extension which might be worth looking at.
I've tried to write a "flatten" function in the past. (I called it unwind
) that would take a field with a JSON or similar structured and turn it into a struct field. The problem, however, is that scalar functions in DataFusion have to have consistent/known output types (and you can't have a function that returns an arbitrary compound type.
The alternative in the short term is to provide jaq
support in a function.. which would let you query a json object (with jq
like syntax) to project out another value. It'd probably have to always return a string (which you could then use the arrow_cast function.
I'd previously pushed back on using jaq
over jq
because without a lack of specification, I didn't want to set false expectations given different implementations and edge cases. The problem with using JQ directly is that it has an inlining feature that can read from the filesystem, which is a security hole that I don't think we can plug safely.
I think we should definitely check with @scsmithr that the new function stuff he's been working on will not get in the way of these kinds of casting operations.
Description
It would be really nice if we could provide more utilities for data exploration of json/non-tabular data. Something like the ability to unpack nested data in a column in a query. The fact that we enable SQL for joining tabular with non-tabular data is really cool, but I think a lot of JSON doesn't really lend itself to this without doing additional processing work beforehand.
Given a JSON with a list of GH users for instance:
Currently, each user will be a row, but the rest of the user info will all be crammed into a single column. It would be nice to do something equivalent to this Pandas code:
e.g. maybe
That's pretty hacky, and I know we're not going to cover every situation, but if we could address a couple of common formats, that could be good. And maybe we go with
normalize
instead offlatten
.