fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
924 stars 197 forks source link

Add Frame.concat #412

Closed sebhofer closed 2 years ago

sebhofer commented 5 years ago

Adds a concatenation function appending the columns of multiple frames. This is essentially a special case of merge, however much more efficient if the column keys of all input frames are identical.

I wrote (a variant of) this function when trying to nest/unnest large frames, which would never finish. unnest uses mergeAll, which (at least when used with a frame created by nest that has identical column keys) is much to general and possibly much too slow. Down the line I would thus suggest to use the concat function for unnest.

I will add unit tests for concat in the next couple of days.

zyzhu commented 5 years ago

@sebhofer This is a tough call. Without understanding all the details of mergeAll implementation. Here's my comments.

  1. pandas.concat https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html has a very similar behavior as Frame.mergeAll and Series.mergeAll. I would suggest we stick to the current terminology in Deedle to call them mergeAll. In the future we can setup a similar table as following https://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html for users to get familiar with one-to-one map of functionality. Hence I would not add another new function called concat as it will confuse users.
  2. If the implementation of mergeAll is too slow, is it possible to add your implementation as a special case under mergeAll? You may check if column names are identical among all, function could opt for your implementation. Otherwise it goes with generic function. Maybe there are ways to optimize the current mergeAll function. But I need to read more about it.
  3. Your unit tests are comparing frame with frame. But comparing two identical frames will turn out to be false. You may compare column by column such as
    actual?B |> shouldEqual expected?B
sebhofer commented 5 years ago

@zyzhu Sorry for the wait, I was on vacation. I will look into implementing my concatenation as a special case of mergeAll, I think it should be possible.

sebhofer commented 5 years ago

@zyzhu I just looked at the pandas documentation, and pandas has an append function that basically does what my proposed function does, and they also point out that it is a special case of their concat function (see here). How about we also go with append? Personally I would prefer this solution because it's more explicit for the user. Also, the name append describes really well what this function does (better than merge in my opinion).

zyzhu commented 5 years ago

Sorry for late reply. I was out on vacation.

pandas append is a special case of concat. It will treat columns not in the existing one as new columns. Your implementation suggests it only works on two frames with exactly the same columns.

For instance, column C is added as a new column here.

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df3 = pd.DataFrame([[5, 6, 9], [7, 8, 10]], columns=list('ABC'))
df.append(df3)
   A  B     C
0  1  2   NaN
1  3  4   NaN
0  5  6   9.0
1  7  8  10.0
sebhofer commented 5 years ago

Ah, ok. Well I think it should be easy to put that in. What do you think makes more sense: Try to make this a special case of merge or make it work like pandas append (and also call it append)? I would prefer the latter.

On Wed, 5 Sep 2018, 16:29 Zhenyong Zhu notifications@github.com wrote:

Sorry for late reply. I was out on vacation.

pandas append is a special case of concat. It will treat columns not in the existing one as new columns. Your implementation suggests it only works on two frames with exactly the same columns.

For instance, column C is added as a new column here.

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) df3 = pd.DataFrame([[5, 6, 9], [7, 8, 10]], columns=list('ABC')) df.append(df3) A B C0 1 2 NaN1 3 4 NaN0 5 6 9.01 7 8 10.0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fslaborg/Deedle/pull/412#issuecomment-418751756, or mute the thread https://github.com/notifications/unsubscribe-auth/AGwVmui1jCsdttrIl0QLQb4VNn-GvB9_ks5uX9_QgaJpZM4V40zc .

zyzhu commented 5 years ago

@sebhofer df.append works similar to the extension method in Deedle to support OOP style so that user can press dot and find functions of the frame. Inside pandas' implementation, it is still calling pd.concat function https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/frame.py#L6209

Deedle already has Merge in the extension that works similar to df.append https://github.com/fslaborg/Deedle/blob/master/src/Deedle/FrameExtensions.fs#L1114

If you add append function and support mismatched columns, it will generate exactly the same result as the current merge function. That might be confusing to user to have two functions with same functionalities. But your implementation will be much faster if column names are the same. I felt it serves more as a special case in merge. What do you think?

sebhofer commented 5 years ago

Well, in general it doesn't do the same... using merge you can add columns, my function can't do that. I feel that "append" describes very well what it does, better so than "merge". So I don't think it would be confusing to the user. But in the end it's obviously your decision.

zyzhu commented 5 years ago

@sebhofer I'm just a new OSS maintainer. I'll just facilitate the discussions to a point that reasonable decision can be made on Deedle.

If it's explicit in documentation that frame.append only works for frames with the same columns but will generate better performance, I think it's fair to add it.

Please rename concat function to append, fix unit tests and add Append in FrameExtension to wrap it up. Thanks!

sebhofer commented 5 years ago

Oh, okay. Sure, I'll do that.

All said, I just looked at the mergeAll code again and eventually it would certainly be the best solution if we could make this part perform better. Sadly, this is just above my head at the moment. Maybe @tpetricek has some input here?

Regarding decision making Sometimes someone has to make a final decision and I think it's only fair that this is the person putting most work in. And I think you've done a great job since you've taken over maintaining Deedle!

tpetricek commented 5 years ago

[Sorry, this is a bit of a rant caused by looking at this...]

I think this PR shows a general difficulty with adding things to Deedle - If you look at how Merge is implemented, you'll find out that it is quite sophisticated. The way it works is that it uses indexBuilder.Merge to get a command that is then applied to the individual vectors that store the data in the data frame.

This pattern is used in most Deedle operations - the idea is to separate the "indexing" from "data representation". However, it also means that adding new features to Deedle in a "proper" way is quite hard, because you should either use the existing indexing operations (and existing commands) or create new ones (if absolutely necessary).

Also, you do not necessarily get much by doing that - the abstraction lets us do a few things in "Big Deedle" (virtualized data frames), but nobody uses that. Aside from that, it leave some room for future improvements in how things are done (e.g. making the underlying commands faster will make many things faster).

So, in principle, the way this is implemented is going against some of the Deedle principles, but then, in practice, this is a very sensible thing to do.

[Back to the topic]

If we could make this a special case optimization for the existing functionality, I think that would be ideal...

sebhofer commented 5 years ago

@tpetricek I did indeed look at the merge implementation and I saw it is quite sophisticated. In fact this is one of the reasons why I was against making my concat function a special case of merge: it would basically be two different functions in some kind of branching construct. Although I still think that the functionality of merge and concat are distinct enough to justify a separation into two functions, I also think that merge should have the same performance for this special case in the end. (Especially so as the reason for me coming up with this concat was to speed up unnest, which uses merge.)

So I will try to understand the details of your implementation better and hope that I can come up with a "proper" solution. If you have any suggestions for what to look at to help me understand how to use the index and vector builder commands properly, I'd be grateful.

@zyzhu Sorry for flip-flopping on this, I know we've been discussing this for a while now!

zyzhu commented 2 years ago

Close this overdue pull request. Could revisit in the future.