Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
821 stars 58 forks source link

readJson with keyValuePaths parameter produces unexpected DataFrame #567

Open koperagen opened 8 months ago

koperagen commented 8 months ago

With this data https://covid.ourworldindata.org/data/owid-covid-data.json that is surprisingly a "wide" JSON i found that keyValuePaths can be helpful. I tried this:

val df = DataFrame.readJson(
        "https://covid.ourworldindata.org/data/owid-covid-data.json",
        keyValuePaths = listOf(JsonPath())
    )

image image

Much better than original that cannot even be compiled in notebooks, but for schema i want i need to df.explode().ungroup("value") image image

@Jolanrensen is my JsonPath wrong?

koperagen commented 8 months ago

After we figure this out documentation should be updated for readJson, because now without keyValuePaths it hangs notebooks. Schema is just way to big

Jolanrensen commented 8 months ago

Your keyValuePaths are correct :) Just one thing to note is that ColumnGroups (including DataFrames) at the given key-value path are wrapped into FrameColumns. This makes sense if it involves a keyValuePath at a lower level, however, if you give a top-level path, this means the entire dataframe will be inside a frame column too. So simply take

val df = DataFrame.readJson(
    "https://covid.ourworldindata.org/data/owid-covid-data.json",
    keyValuePaths = listOf(JsonPath())
)[0][0]

and you're good to go :)

Funny thing is that this actually can be seen done automatically in the OpenAPI schema generation: If the keyValuePaths include the top-level path ("$"), the ["value"].first() is taken of the read dataframe before converting it to the expected type:

image

Maybe we could move this logic to readJson instead :)

koperagen commented 8 months ago

Yes, let's "unwrap" top level key-value in readJson