GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
567 stars 36 forks source link

Feature request: More support for exploring json/non-tabular data #3038

Open talagluck opened 2 weeks ago

talagluck commented 2 weeks ago

Description

It would be really nice if we could provide more utilities for data exploration of json/non-tabular data. Something like the ability to unpack nested data in a column in a query. The fact that we enable SQL for joining tabular with non-tabular data is really cool, but I think a lot of JSON doesn't really lend itself to this without doing additional processing work beforehand.

Given a JSON with a list of GH users for instance:

[
    {
            "starred_at": "2024-05-13T14:33:25Z",
            "user": {
                "login": "talagluck",
                "id": 15267635,
                "node_id": "MDQ6VXNlcjE1MjY3NjM1",
                "avatar_url": "https://avatars.githubusercontent.com/u/15267635?v=4",
                "gravatar_id": "",
                "url": "https://api.github.com/users/talagluck",
                "html_url": "https://github.com/talagluck",
                "followers_url": "https://api.github.com/users/talagluck/followers",
                "following_url": "https://api.github.com/users/talagluck/following{/other_user}",
                "gists_url": "https://api.github.com/users/talagluck/gists{/gist_id}",
                "starred_url": "https://api.github.com/users/talagluck/starred{/owner}{/repo}",
                "subscriptions_url": "https://api.github.com/users/talagluck/subscriptions",
                "organizations_url": "https://api.github.com/users/talagluck/orgs",
                "repos_url": "https://api.github.com/users/talagluck/repos",
                "events_url": "https://api.github.com/users/talagluck/events{/privacy}",
                "received_events_url": "https://api.github.com/users/talagluck/received_events",
                "type": "User",
                "site_admin": false
            }
        },
{
        "starred_at": "2023-05-26T22:24:04Z",
        "user": {
            "login": "scsmithr",
            "id": 4040560,
            "node_id": "MDQ6VXNlcjQwNDA1NjA=",
            "avatar_url": "https://avatars.githubusercontent.com/u/4040560?v=4",
            "gravatar_id": "",
            "url": "https://api.github.com/users/scsmithr",
            "html_url": "https://github.com/scsmithr",
            "followers_url": "https://api.github.com/users/scsmithr/followers",
            "following_url": "https://api.github.com/users/scsmithr/following{/other_user}",
            "gists_url": "https://api.github.com/users/scsmithr/gists{/gist_id}",
            "starred_url": "https://api.github.com/users/scsmithr/starred{/owner}{/repo}",
            "subscriptions_url": "https://api.github.com/users/scsmithr/subscriptions",
            "organizations_url": "https://api.github.com/users/scsmithr/orgs",
            "repos_url": "https://api.github.com/users/scsmithr/repos",
            "events_url": "https://api.github.com/users/scsmithr/events{/privacy}",
            "received_events_url": "https://api.github.com/users/scsmithr/received_events",
            "type": "User",
            "site_admin": false
        }
    }
]

Currently, each user will be a row, but the rest of the user info will all be crammed into a single column. It would be nice to do something equivalent to this Pandas code:

with open('stargazers.json', 'r') as f:
    data = json.load(f)
flattened_data = []
for entry in data:
    user_data = entry['user']
    user_data['starred_at'] = entry['starred_at']
    flattened_data.append(user_data)

# Create DataFrame
df = pd.DataFrame(flattened_data)

e.g. maybe

    SELECT starred_at, flatten(user) FROM 'github_users.json' 

That's pretty hacky, and I know we're not going to cover every situation, but if we could address a couple of common formats, that could be good. And maybe we go with normalize instead of flatten.

talagluck commented 2 weeks ago

DuckDB has a json extension which might be worth looking at.

tychoish commented 2 weeks ago

I've tried to write a "flatten" function in the past. (I called it unwind) that would take a field with a JSON or similar structured and turn it into a struct field. The problem, however, is that scalar functions in DataFusion have to have consistent/known output types (and you can't have a function that returns an arbitrary compound type.

The alternative in the short term is to provide jaq support in a function.. which would let you query a json object (with jq like syntax) to project out another value. It'd probably have to always return a string (which you could then use the arrow_cast function.

I'd previously pushed back on using jaq over jq because without a lack of specification, I didn't want to set false expectations given different implementations and edge cases. The problem with using JQ directly is that it has an inlining feature that can read from the filesystem, which is a security hole that I don't think we can plug safely.

I think we should definitely check with @scsmithr that the new function stuff he's been working on will not get in the way of these kinds of casting operations.