StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.01k stars 1.81k forks source link

Support converting JSON to STRUCT #43101

Open murphyatwork opened 7 months ago

murphyatwork commented 7 months ago

Feature request

Is your feature request related to a problem? Please describe.

STRUCT type is more performant that JSON in terms of query, but which use the fixed schema and annoying to construct.

If it's possible to convert JSON to STRUCT directly, like to_struct(JSON), it can be much easier to use.

Describe the solution you'd like

  1. Use a function to convert JSON to STRUCT:
    • Example: select to_struct(parse_json("{"a": 1}"))
    • Difficulty: cannot inference the schema of struct before parsing the JSON
  2. Specify the STRUCT schema and convert:
    • Example: select to_struct<a int, b int>(parse_json("a": 1, "b": 2))
    • Difficulty: not easy to use

Describe alternatives you've considered

Additional context

jaogoy commented 7 months ago

Maybe, we could implement the second one first, and try to do the first one in the future (if needed), which will use the implementation for the second one.

kyle-goodale-klaviyo commented 1 month ago

+1 on this issue. We are currently evaluating StarRocks as a replacement for Trino. However, we very much rely on being able to cast JSON values to Structs at query time which a feature supported by Trino but not by StarRocks.

Using named_struct and parse_json as well as json extract functions works for simple objects decently enough. However, json objects that contain arrays of objects become very challenging to extract if you want to preserve the relations of keys in the inner objects and not flatten the data.

Dshadowzh commented 1 month ago

Thanks, @kyle-goodale-klaviyo . This issue focuses on stream load, and we have implemented the conversion of JSON to structs during data ingestion. Could you share some examples (SQL snippets) demonstrating how to use Trino to convert JSON values into structs?
CC @wangsimo0

kyle-goodale-klaviyo commented 1 month ago

@Dshadowzh Absolutely, here is a minimal example that includes nested objects, as well as arrays of objects:

SELECT 
   CAST(json_parse(
   '{"a": 5, "b_obj": {"bb": true}, "b_arr_obj": [{"foo": "bar"}, {"foo": "baz"}]}'
   ) AS ROW(
        a int,
        b_obj ROW(
            bb boolean
        ),
        b_arr_obj ARRAY(
            ROW(
                foo varchar
            )
        )
    )
   )

Trino calls structs rows, but they are the same thing. One of the important parts is that when casting an array of objects it preserves them as arrays of structs. Currently, I can somewhat get around this issue in StarRocks by using named struct and extracting the json values manually.

StarRocks example:

named_struct(
    'first_name', get_json_string(`data`, '$.first_name'),
    'last_name', get_json_string(`data`, '$.last_name')
)

This works great for objects and even nested objects. However, this breaks down when you encounter an array of objects if you wish to preserve the inner objects as structs within an array. This method of extraction requires that you flatten everything as there is no way to traverse the json lists and extract values out while maintaining there relations to one another.

Please do let me know if I can provide any more examples or clarify the issue further!

murphyatwork commented 1 month ago

@Dshadowzh Absolutely, here is a minimal example that includes nested objects, as well as arrays of objects:

SELECT 
   CAST(json_parse(
   '{"a": 5, "b_obj": {"bb": true}, "b_arr_obj": [{"foo": "bar"}, {"foo": "baz"}]}'
   ) AS ROW(
        a int,
        b_obj ROW(
            bb boolean
        ),
        b_arr_obj ARRAY(
            ROW(
                foo varchar
            )
        )
    )
   )

Trino calls structs rows, but they are the same thing. One of the important parts is that when casting an array of objects it preserves them as arrays of structs. Currently, I can somewhat get around this issue in StarRocks by using named struct and extracting the json values manually.

StarRocks example:

named_struct(
    'first_name', get_json_string(`data`, '$.first_name'),
    'last_name', get_json_string(`data`, '$.last_name')
)

This works great for objects and even nested objects. However, this breaks down when you encounter an array of objects if you wish to preserve the inner objects as structs within an array. This method of extraction requires that you flatten everything as there is no way to traverse the json lists and extract values out while maintaining there relations to one another.

Please do let me know if I can provide any more examples or clarify the issue further!

it's better to implement a native function converting JSON ARRAY to ARRAY to accomplish this, otherwise it would be pretty verbose