Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.82k stars 113 forks source link

Add a new JSON column type and jq accessor method #1995

Closed jaychia closed 3 months ago

jaychia commented 3 months ago

Is your feature request related to a problem? Please describe.

Often users may have columns of JSON-compatible strings.

df = daft.from_pydict({
    "my_json_strings": [
        '{"foo": {"bar": 1}}',
        '{"foo": {"bar": 2}}',
        '{"foo": {"bar": 3}}',
    ]
})

We should allow for JSON-typed columns:

df = df.with_column("my_json", df["my_json_strings"].json.loads())

Which would then enable jq-style accessors

df = df.with_column("extracted_bar", df["my_json"].json.get(".foo.bar"))
samster25 commented 3 months ago

Made a ticket to favor json logical types https://github.com/Eventual-Inc/Daft/issues/1993

samster25 commented 3 months ago

We shouldn't have to parse a string into a json (like what you have above). I dont think this should be like a struct type. It's very possible that every row has a completely different schema. We should keep it as a string in that case This is what db engines like postgres do as well https://www.postgresql.org/docs/current/datatype-json.html

jaychia commented 3 months ago

I dont think this should be like a struct type.

Yup I agree, the intention for the .json.loads() expression was to go from String -> JSON logical type, not Struct.

samster25 commented 3 months ago

Not too clear why we have to run .json.loads() since it would be a no-op. typical way for DBs is to run a cast CAST(text as json)

jaychia commented 3 months ago

.cast(daft.DataType.json()) is fine too, I just thought .json.loads() would be more explicit

Another possible approach could be similar to have .json.get() be similar to how we do .url.download(): we don't have a URL logical type, but we just run operations directly on the string type.

samster25 commented 3 months ago

Another possible approach could be similar to have .json.get() be similar to how we do .url.download(): we don't have a URL logical type, but we just run operations directly on the string type.

I'm cool with this (should also be less work) but we should just extend it to full jq syntax with .json.query(QUERY)

colin-ho commented 3 months ago

For the return type of .json.query(QUERY), do we want to standardize it as String? Or would it be better user experience to allow them to specify a return type? E.g if they have a json string like "[1,2,3]" and query .[0], they can specify the return type as int?

jaychia commented 3 months ago

For the return type of .json.query(QUERY), do we want to standardize it as String? Or would it be better user experience to allow them to specify a return type? E.g if they have a json string like "[1,2,3]" and query .[0], they can specify the return type as int?

I feel like string should be ok, and then users can cast/coerce the string into the appropriate type after:

Another question would be around missing index/path handling - do we return null or throw an error? Or perhaps this should also be configurable by the user.