Closed jaychia closed 3 months ago
Made a ticket to favor json logical types https://github.com/Eventual-Inc/Daft/issues/1993
We shouldn't have to parse a string into a json (like what you have above). I dont think this should be like a struct type. It's very possible that every row has a completely different schema. We should keep it as a string in that case This is what db engines like postgres do as well https://www.postgresql.org/docs/current/datatype-json.html
I dont think this should be like a struct type.
Yup I agree, the intention for the .json.loads()
expression was to go from String -> JSON logical type, not Struct
.
Not too clear why we have to run .json.loads()
since it would be a no-op. typical way for DBs is to run a cast
CAST(text as json)
.cast(daft.DataType.json())
is fine too, I just thought .json.loads()
would be more explicit
Another possible approach could be similar to have .json.get()
be similar to how we do .url.download()
: we don't have a URL logical type, but we just run operations directly on the string type.
Another possible approach could be similar to have .json.get() be similar to how we do .url.download(): we don't have a URL logical type, but we just run operations directly on the string type.
I'm cool with this (should also be less work) but we should just extend it to full jq
syntax with .json.query(QUERY)
For the return type of .json.query(QUERY)
, do we want to standardize it as String? Or would it be better user experience to allow them to specify a return type? E.g if they have a json string like "[1,2,3]"
and query .[0]
, they can specify the return type as int?
For the return type of
.json.query(QUERY)
, do we want to standardize it as String? Or would it be better user experience to allow them to specify a return type? E.g if they have a json string like"[1,2,3]"
and query.[0]
, they can specify the return type as int?
I feel like string should be ok, and then users can cast/coerce the string into the appropriate type after:
"1"
then they can cast to int32"[0, 1, 2]"
then they can cast to list[int32] (though I'm not actually use if this works today)"1/1/2022"
then they can cast to dateAnother question would be around missing index/path handling - do we return null or throw an error? Or perhaps this should also be configurable by the user.
Is your feature request related to a problem? Please describe.
Often users may have columns of JSON-compatible strings.
We should allow for JSON-typed columns:
Which would then enable jq-style accessors