Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
761 stars 48 forks source link

Extend "unfold" operation and support it in the compiler plugin #742

Open koperagen opened 1 week ago

koperagen commented 1 week ago

It covers two interesting use cases:

  1. Replace column with multiple, potentially nested. Provides DSL similar to add for that
  2. More fine-grained toDataFrame. Instead of converting 20-30 properties to 2-3 level of nesting all at once user can choose to convert toDataFrame(maxDepth = 0) and unfold required properties to whatever level they need

On the compiler plugin side i will continue to support other overloads later in different PR

github-actions[bot] commented 1 week ago

Generated sources will be updated after merging this PR. Please inspect the changes in here.

koperagen commented 5 days ago

Unfold itself, by definition, already replaces a column with a new column

What definition? Originally i wanted a add replace by function with AddDsl. Then i remembered that toDataFrame DSL is pretty similar to AddDsl and we already have unfold function, it just lacks an overload with a DSL. Such an overload must be a multiplex operation, right? unfold by is interesting, but it will have exactly one operation probably forever. Sounds good, right, but its semantics is no different from replace. I'd rather have one entry point (replace) for situations when you want to, like, replace a column with a different one.

Jolanrensen commented 5 days ago

What definition? Originally i wanted a add replace by function with AddDsl. Then i remembered that toDataFrame DSL is pretty similar to AddDsl and we already have unfold function, it just lacks an overload with a DSL. Such an overload must be a multiplex operation, right? unfold by is interesting, but it will have exactly one operation probably forever. Sounds good, right, but its semantics is no different from replace. I'd rather have one entry point (replace) for situations when you want to, like, replace a column with a different one.

I meant the definition of "unfolding", you're unfolding a column with its contents, so its type is bound to change, aka, the column is replaced.

I can see where you're coming from, but it still may be hard for users to have two different ways to unfold, namely .replace {}.unfold {} and .unfold() and both work a little bit different and have different arguments.

So I'd either:

wdyt?

koperagen commented 4 days ago

I'd say df.unfold should stay, because use case for simply unfolding a column with objects stays. Worth to add df.unfold(maxDepth = , roots = ) overload too, missed it.

So I'd either: or make just df.unfold {} more powerful

Please write this API with needed overloads and a few examples of its usage then

Jolanrensen commented 7 hours ago

I'd say df.unfold should stay, because use case for simply unfolding a column with objects stays. Worth to add df.unfold(maxDepth = , roots = ) overload too, missed it.

So I'd either: or make just df.unfold {} more powerful

Please write this API with needed overloads and a few examples of its usage then

I made a little sample of unfold {}.by {} with the same options as your version. Both df.unfold {} and df.unfold {}.by {} can be used. See the commit here for more details: https://github.com/Kotlin/dataframe/commit/20cab7a12968babcb7f9d8ce8a24e3ca0917eda1

koperagen commented 4 hours ago

UnfoldingDataFrame looks good, we can use it. Supporting it on the plugin side will require some changes, but it's ok. I'm only worried with return type being different than DataFrame people will be tempted to call df.unfold { col }.by(). It's first time intermediate object in a multiplex operation is also a DataFrame. Or the opposite problem: df.unfold { col }. will print all DataFrame API, with by being somewhere in completion list no different than let's say filter But if we go this route, i'd also add df.replace {}.by(CreateDataFrameDsl) (only this, without vararg props: KProperty<*> and maxDepth: Int = 0 overloads)

Jolanrensen commented 3 hours ago

@koperagen Yes, I couldn't find another way to have a notation with 2 selectors and the second one being optional while keeping the DataFrame DSL style :/ but indeed, it's something new. A bit like .recursively().

Actually, since by is defined on UnfoldingDataFrame it should appear quite high on the list: image

We could also put it inside the class to make it even more discoverable.

Alternatively, we could change the return-type of unfold to something non-dataframe-ish and let people call df.unfold { a }.byReplacing() or something.

Imagine that XD df.replace { a }.byUnfolding() and df.unfold { a }.byReplacing() and it would do the same.

replace {}.by {} also looks interesting :)