Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
763 stars 48 forks source link

KTNB-693 Send the full dataframe schema as metadata #706

Closed cmelchior closed 1 month ago

cmelchior commented 1 month ago

This part adds the infrastructure needed for https://youtrack.jetbrains.com/issue/KTNB-693/Enable-AI-Actions-for-DataFrames-in-Kotlin-Notebooks as we currently are not able to detect column types in a good way which is needed when creating prompts for the AI Assistant.

It adds a new "types" property to the top-level "metadata" as well as recursively on each row so it is possible to easily identify column types.

A columns property has also been added to ColumnGroup and FrameColumn metadata, it contains nested column names similar to the top-level columns property.

Example:

val col1 by columnOf("a", "b", "c")
val col2 by columnOf(1, 2, 3)
val col3 by columnOf("Foo", "Bar", null)
val df2 = dataFrameOf(Pair("header", listOf("A", "B", "C")))
val col4 by columnOf(df2, df2, df2)
var df = dataFrameOf(col1, col2, col3, col4)
df.group(col1, col2).into("group")            
{
   ...
             {
              "${'$'}version": "2.1.0",
              "metadata": {
                "columns": ["group", "col3", "col4"],
                "types": [{
                  "kind": "ColumnGroup"
                }, {
                  "kind": "ValueColumn",
                  "type": "kotlin.String?"
                }, {
                  "kind": "FrameColumn"
                }],
                "nrow": 3,
                "ncol": 3
              },
              "kotlin_dataframe": [{
                "group": {
                  "data": {
                    "col1": "a",
                    "col2": 1
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": "Foo",
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }, {
                "group": {
                  "data": {
                    "col1": "b",
                    "col2": 2
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": "Bar",
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }, {
                "group": {
                  "data": {
                    "col1": "c",
                    "col2": 3
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": null,
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }]
            }
}
ermolenkodev commented 1 month ago

There is a problem when a FrameColumn contains frames with different schemas. I recommend attaching types to the metadata of each nested frame. This may lead to duplication if the schema of each nested frame is the same, but it will make it easier to work with on the Kotlin Notebook plugin side. We already have a lot of duplication because we pass column names for each value in rows, so this additional overhead will be minimal. Here is the short reproducer of the problem: dataFrameOf("a", "b")(1, dataFrameOf("c", "d")(1, 2), 2, dataFrameOf("e", "f")(1, 2))

cmelchior commented 1 month ago

@ermolenkodev I see your point. I forgot to think about that each row could hold different schemas for data frame references. So you are right, it is probably better to have the schema as part of the metadata inside the data frame content.

I'll refactor it.

cmelchior commented 1 month ago

After some discussion with @ermolenkodev we decided to rework the metadata a little. I have updated the PR and description. So it should be ready for a 2nd round of review.