Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
786 stars 50 forks source link

dataFrameOf doesn't recognize @DataSchema instances #177

Closed Jolanrensen closed 5 days ago

Jolanrensen commented 1 year ago

given

@DataSchema
data class Location(
    val name: String,
    val gps: Gps?,
)

@DataSchema
data class Gps(
    val latitude: Double,
    val longitude: Double,
)

val a: DataFrame<Location> = listOf(
    Location("Home", Gps(0.0, 0.0)),
    Location("Away", null),
).toDataFrame()
// ⌌----------------------------------------------------------⌍
// |  | name:String| gps:{latitude:Double?, longitude:Double?}|
// |--|------------|------------------------------------------|
// | 0|        Home|  { latitude:0.000000, longitude:0.0000...|
// | 1|        Away|                                       { }|
// ⌎----------------------------------------------------------⌏
// 
// name: String
// gps:
//     latitude: Double?
//     longitude: Double?

val b: AnyFrame = dataFrameOf("name", "gps")(
    "Home", Gps(0.0, 0.0),
    "Away", null,
)
// ⌌-------------------------------------------------⌍
// |  | name:String|                         gps:Gps?|
// |--|------------|---------------------------------|
// | 0|        Home| Gps(latitude=0.0, longitude=0.0)|
// | 1|        Away|                             null|
// ⌎-------------------------------------------------⌏
// 
// name: String
// gps: Gps?

One would expect a and b to give the same result. However, in a gps is recognized as a Group and in the second as a Value.

2x2xplz commented 1 year ago

I'm not on the JetBrains team, so take this with a grain of salt, but I don't agree with your assumption that a and b should give the same result. Every row in a is specifically defined to be a Location (and can reference its schema), b never states the intended type.

The docs for dataFrameOf state: Returns DataFrame with given column names and values. It seems consistent with those docs that b would define the second column as values (of Gps), and not a ColumnGroup. Meanwhile toDataFrame's docs mention perform deep object graph traversal and convert nested objects into ColumnGroups and FrameColumns, so not surprising to see independent nested cols for latitude and longitude.

Even if it had the capability to do so, wouldn't expect the program to search through all of your available DataSchemas to see which is the best fit... you have to specify the schema somewhere. And now that I say that, it would actually be potentially dangerous to have the program just assume that every pair of String + Gps is automatically a Location, without you specifying that.

Just my 2 cents, I'm still learning more about the dataframe API every day.

Jolanrensen commented 1 year ago

I'm not saying it should recognize it as a Location, but like a { latitude:0.0, longitude:0.0 }. But I see your point to be able to read it as "value". It's a valid point!

2x2xplz commented 1 year ago

It's interesting, a is actually represented by 3 columns -- name, gps/latitude and gps/longitude. .toDataFrame() breaks down Location into its two properties and then it breaks down the Gps type into its two properties and creates nested columns for each. That's actually pretty cool. dataFrameOf() is simpler, it just creates a single value per column, so in this case that value's type is Gps. But it doesn't perform the extra step of creating nested columns from Gps.

image

image

Jolanrensen commented 1 year ago

Exactly! That's why I thought it would (or should) behave the same. Unless you specifically want it to be a Value Column

nikitinas commented 1 year ago

Currently .toDataFrame allows to configure objects traversal, including depth, but dataFrameOf doesn't. If dataFrameOf unfolds all objects marked with @DataShema into ColumnGroups by default, there should be an option to fallback to ValueColumns, because there is no universal way to pack column groups back into objects (although it can be supported for data classes).

So, I agree with suggestion to unpack @DataSchema objects by default, but let's add some DSL for dataFrameOf configuration to make it flexible:

@DataSchema
data class A(val value: Int)

@DataSchema
data class B(val c: C)

@DataSchema
data class C(val value: Int)

val b: AnyFrame = dataFrameOf("a","b")(
    A(1), B(C(2)),
    A(3), null,
) {
  depth = 1 // default depth of object traversal: unfold only top-level objects
  unfold<C>() // force unfolding of `C`
  preserve<A>() // prevent `A` from unfolding
}

// ⌌-------------------------------------------------⌍
// |  | a:A        |         b:{c:{value: Int}}      |
// |--|------------|---------------------------------|
// | 0| A(value=1) | { c: {value:2}}                 |
// | 1| A(value=3) | { c: {value: null}}             |
// ⌎-------------------------------------------------⌏

Default depth will be null that means that all @DataSchemaobjects should be unfolded and all other objects should be preserved. depth = 0 means that no objects should be unfolded and @DataSchema annotation should be ignored.

nikitinas commented 1 year ago

.toDataFrame and unfold should have the same behaviour and configuration DSLs

pacher commented 1 year ago

Just my +1 for configuration option.

Unless you specifically want it to be a Value Column

But how can I achieve this if it just unfolds everything by default?

Use case: I have a data class representing a set of parameters for a measurement. So first I want to group results by parameters, but later after some processing I would like to unfold parameter set to distinct columns of parameters. So it make sense to me to annotate my parameter set with DataSchema.

+1 @nikitinas I can always call unfold on a column, but I can't fold it back if it is unfolded automagically for me.

Jolanrensen commented 5 days ago

Now we have DataRowSchema which allows inheriting classes to be unfolded into dataframes with dataFrameOf I think we can close this solution.

toDataFrame() can convert a list of objects to the depth the user defines, which is not possible in dataFrameOf()() so we cannot assume that's what the user wants.