Closed samuelcolvin closed 2 months ago
To be clear, I don't think this is a blocker, just something to think about.
We can get around this ambiguity but providing:
json_get_path(json: str, *key: str | int)
Or maybe just making that the signature of json_get
, and thereby mostly avoiding the ambiguity (and the need to parse the JSON twice) I think.
I think I have a solution for this using unions...
I think I have a solution for this using unions...
Yes, I think this is likely the only way to go -- Snowflake uses a VARIANT
type for this as I understand: https://docs.snowflake.com/en/sql-reference/data-types-semistructured
BTW I am not sure how mature UnionArray
support is in DataFusion. But I think there are several other contributors who are interested too
BTW 2 I think @WenyXu has some other ideas here: https://github.com/apache/datafusion/issues/7845#issuecomment-2068061465
I think unions solve this provided we can find a solution to https://github.com/apache/datafusion/issues/10180.
This is solved mostly by rewriting the query.
@alamb As you'll see I've started work in #1 and pydantic/jiter#84.
But I've realised we might need some to differentiate between nested Arrays and Objects, represented as strings, and JSON strings.
Consider the following cases:
json_get('{"foo": "bar"}', 'foo') -> 'bar'
json_get('{"foo": [1, 2, 3]}', 'foo') -> '[1, 2, 3]'
The returned values represent very different things, but unless introduce some new type, would both be represented as strings.
Even worse:
json_get('{"foo": "[1, 2, 3]"}', 'foo') -> '[1, 2, 3]'
- here the return value exactly matches the case above, even though the JSON is differentThe main case where this becomes problematic is when you want to do:
Clearly the simplest solution is some kind of
JSON
marker type, but I've no idea how hard this is to define within datafusion?