Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
840 stars 63 forks source link

Auto flatten FrameColumns after unfold and aggregate. #232

Open holgerbrandl opened 1 year ago

holgerbrandl commented 1 year ago

When using unfold or groupBy aggregations, it always feels bit cumbersome to call flatten on the result to pull the columns out of the FrameColumn.

I'd argue that by default the user always expects a flattened result because that's the behavior in r and python.

pacher commented 1 year ago

I'd argue that it is easy to call flatten but it is much harder to put the columns back into FrameColumn for those of us (me included) who needs it. As a minimum this should be configurable, something like groupBy(flatten = false) {}

P.S. I could not find hold or even fold operator and really curious about it :)

holgerbrandl commented 1 year ago

Clearly, having sub-data-frame is a nice concept that probably has its applications, so I'd also favor a configuration parameter here, which should imho default to true. This is because in a typical exploratory analysis protocols, at least I use dozens of aggregate calls (or currently their equivalent summarize since I still do them in R). When doing so, I never felt the need to keep the aggregates separate from the remaining columns.

I'd think that the setting would rather need to be added when doing the aggregation:

    irisData
                .groupBy("Species")
                .aggregate(flatten = false) {  // <- config option
                    mean() into "mean"
                    std() into "sd"
                }

PS unfold was added in v0.9

Jolanrensen commented 1 year ago

Well, in comparison to R and Python, DataFrame does have hierarchical data frames, so it makes more sense in the context of DataFrame to create a nested data frame here. Like @pacher said, it's more difficult to put it back after flattening than it is to just .flatten() the result. That said, a configuration option that defaults to false is not out of the question.

holgerbrandl commented 1 year ago

The concept of nested data (https://tidyr.tidyverse.org/articles/nest.html which is a great intro into the subject btw) is clearly beautiful. Though, in my experience, it's rarely used among data scientists who often prefer sticking to simple flat tables. When using nesting, it's always something I do intentionally and would never expect as a default.

pacher commented 1 year ago

We all have different background, use cases and preferred workflows. I use kdf with collections of objects in kotlin JVM and for me hierarchical nature of kdf is really awesome and spot-on. So in my case unfolding or flattening is something I do intentionally and I would argue against doing it automatically just because there is no universal way to undo it. As you said, hierarchical frames are beautiful, but I guess there is a bit of a learning curve to fully appreciate it.

Back to the topic

  1. We can add config option to AggregateDsl instead of being a parameter of aggregate function
  2. We can add a bunch of shortcuts, something like unfoldFlat() with a better name which is just unfold().flatten(). I think this is pretty common practice in kotlin standard libraries.
holgerbrandl commented 1 year ago

Sure any API needs to be opinionated. I intentionally try to work out a data-scientist-who-is-now-but-interested-in-kotlin perspective here, that is a less technical but a more R/python based background. I guess without attracting these folks, kotlin will face a very hard time becoming competitive in data-science.

Another argument for default-flattening: aggregate has imho no consistent contract as the output structure depends on the number of aggregates. For 1 its a value-column, for multiple ones it's a column-group. This type of inconsistency is confusing and imho should be avoided. It's very similar to the classical base-R confusion example:

df[, "a"] returns a vector while df[, c("a", "b")] returns a data.frame

Not sure if more utility functions improve readability: Calling it aggregateFlat lacks beauty in my taste compared to dplyr/pandas.

pacher commented 1 year ago

I understand and respect your perspective with python data-scientist in mind. It was clear from the previous answer. But we have a very common, old and hard dilemma here. Of course the famous quote of Henry Ford immediately comes to mind

If I had asked people what they wanted, they would have said faster horses.

But I am getting off-topic again.

aggregate has imho no consistent contract as the output structure depends on the number of aggregates. For 1 its a value-column, for multiple ones it's a column-group.

I find it very consistent that new columns don't appear without explicit add or something like flatten. Unfold and aggregate are transformations of the columns, which take a column and produce a column. Together with convert and update, they are just more specific versions of map/replace. I see ColumnGroup as a single column (it inherits BaseColumn) just like value or frame column. I would find it confusing if List.map could produce more elements than in the original list. Unlike List.flatMap. So if transformations output more than one column, it grows depth-first to keep the number of columns the same unless specifically told not to.

Not sure if more utility functions improve readability: Calling it aggregateFlat lacks beauty in my taste compared to dplyr/pandas.

I agree. And aggregateFlat is a horrible name. I was just throwing in options. In kotlin coroutines there is flatMapMerge(f) = map(f).flattenMerge(), flatMapConcat(f) = map(f).flattenConcat() etc. I think I've seen similar shortcuts with .flatten() even in stdlib, just don't remember.

Disclaimer: I am not a developer of kdf, not even a data scientist. Just a passer-by speculating on the topic. I might change my opinion after doing dozens of aggregations per day like you.

Jolanrensen commented 1 year ago

@holgerbrandl Actually, while of course, we would welcome data library users from other programming paradigms such as Python or R, that's not our intended target audience per se. We are more interested in reaching folks that know Kotlin already, enjoy its syntax and features, and have started to gain an interest in Data Science. These days, Data Science becomes bigger and more common. So, if your application requires some data wrangling it makes sense you don't want to learn a whole new language + export/import to just modify your data a bit. That's where DataFrame comes in! Not to convert Python Data Scientists to use Kotlin, but to convert Kotlin users to Data Scientists :)

holgerbrandl commented 1 year ago

Hmm, I was hoping for more ambition than some data wrangling + import/export, but thanks for the clarification. :-/

I keep my fingers crossed that the project doesn't fall into the same trap as many other JVM libs that are technical beauties but lack the pragmatism to be as useful as more simplistic counterparts built for scripting languages. Learning and adopting design patterns (such as flat tables as main working horse in table processing APIs) from years of community work and consolidation in data-science stacks in R or python could be clearly helpful here.

I hope I did not come across too negative in this thread, but I at least wanted to share my impressions & concerns.

Jolanrensen commented 1 year ago

Well of course we are also focussing on data exploration and representation, combined with explicit integration with tooling and the IDE. For data representation, the types, nullability, and also hierarchy are IMO solid and unique strengths of DataFrame. If you print out your table in the console, a flat table is preferable, but if you can click around and collapse/expand cells (which we can do), hierarchy is awesome. But of course, we want to strike a balance between this exploration, onboarding Kotlin users, and data science, so your input is definitely appreciated :)

nikitinas commented 1 year ago

As @pacher and @Jolanrensen mentioned, Kotlin DataFrame is balancing between common data science use cases and more general/functional API for hierarchical data structures, similar to Kotlin collections of objects.

I think we should keep basic DataFrame API general and consistent and avoid excessive shortcut functions, such as aggregateFlat. Such shortcuts may be not only background-specific (e.g. R data scientist vs. Kotlin data engineer), but also domain-specific, so user can easily create them and store in some util library to reuse in several data projects. We can provide a tutorial on how to do that.

We can also provide preconfigured sets of shortcut functions as separate libs, e.g. dataframe-r with more familiar dataframe API for R data scientists.