datafusion-contrib / datafusion-functions-json

JSON / JSONB support for DataFusion (unofficial)
https://crates.io/crates/datafusion-functions-json
Apache License 2.0
13 stars 5 forks source link

How do we mark the difference between a string value, and an Object or Array represented as JSON #2

Closed samuelcolvin closed 2 months ago

samuelcolvin commented 2 months ago

@alamb As you'll see I've started work in #1 and pydantic/jiter#84.

But I've realised we might need some to differentiate between nested Arrays and Objects, represented as strings, and JSON strings.

Consider the following cases:

The returned values represent very different things, but unless introduce some new type, would both be represented as strings.

Even worse:

The main case where this becomes problematic is when you want to do:

json_get(json_get('{"foo": {"spam": "bar"}}', 'foo'), 'spam')

# or if we introduce arrow syntax
'{"foo": {"spam": "bar"}}'->>'foo'->>'spam'

Clearly the simplest solution is some kind of JSON marker type, but I've no idea how hard this is to define within datafusion?

samuelcolvin commented 2 months ago

To be clear, I don't think this is a blocker, just something to think about.

We can get around this ambiguity but providing:

json_get_path(json: str, *key: str | int)

Or maybe just making that the signature of json_get, and thereby mostly avoiding the ambiguity (and the need to parse the JSON twice) I think.

samuelcolvin commented 2 months ago

I think I have a solution for this using unions...

alamb commented 2 months ago

I think I have a solution for this using unions...

Yes, I think this is likely the only way to go -- Snowflake uses a VARIANT type for this as I understand: https://docs.snowflake.com/en/sql-reference/data-types-semistructured

alamb commented 2 months ago

BTW I am not sure how mature UnionArray support is in DataFusion. But I think there are several other contributors who are interested too

alamb commented 2 months ago

BTW 2 I think @WenyXu has some other ideas here: https://github.com/apache/datafusion/issues/7845#issuecomment-2068061465

samuelcolvin commented 2 months ago

I think unions solve this provided we can find a solution to https://github.com/apache/datafusion/issues/10180.

samuelcolvin commented 2 months ago

This is solved mostly by rewriting the query.