Closed sebhofer closed 2 years ago
@sebhofer This is a tough call. Without understanding all the details of mergeAll implementation. Here's my comments.
actual?B |> shouldEqual expected?B
@zyzhu Sorry for the wait, I was on vacation. I will look into implementing my concatenation as a special case of mergeAll, I think it should be possible.
@zyzhu I just looked at the pandas documentation, and pandas has an append function that basically does what my proposed function does, and they also point out that it is a special case of their concat
function (see here). How about we also go with append
? Personally I would prefer this solution because it's more explicit for the user. Also, the name append describes really well what this function does (better than merge in my opinion).
Sorry for late reply. I was out on vacation.
pandas append is a special case of concat. It will treat columns not in the existing one as new columns. Your implementation suggests it only works on two frames with exactly the same columns.
For instance, column C is added as a new column here.
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df3 = pd.DataFrame([[5, 6, 9], [7, 8, 10]], columns=list('ABC'))
df.append(df3)
A B C
0 1 2 NaN
1 3 4 NaN
0 5 6 9.0
1 7 8 10.0
Ah, ok. Well I think it should be easy to put that in. What do you think makes more sense: Try to make this a special case of merge or make it work like pandas append (and also call it append)? I would prefer the latter.
On Wed, 5 Sep 2018, 16:29 Zhenyong Zhu notifications@github.com wrote:
Sorry for late reply. I was out on vacation.
pandas append is a special case of concat. It will treat columns not in the existing one as new columns. Your implementation suggests it only works on two frames with exactly the same columns.
For instance, column C is added as a new column here.
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) df3 = pd.DataFrame([[5, 6, 9], [7, 8, 10]], columns=list('ABC')) df.append(df3) A B C0 1 2 NaN1 3 4 NaN0 5 6 9.01 7 8 10.0
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fslaborg/Deedle/pull/412#issuecomment-418751756, or mute the thread https://github.com/notifications/unsubscribe-auth/AGwVmui1jCsdttrIl0QLQb4VNn-GvB9_ks5uX9_QgaJpZM4V40zc .
@sebhofer df.append
works similar to the extension method in Deedle to support OOP style so that user can press dot and find functions of the frame. Inside pandas' implementation, it is still calling pd.concat
function
https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/frame.py#L6209
Deedle already has Merge
in the extension that works similar to df.append
https://github.com/fslaborg/Deedle/blob/master/src/Deedle/FrameExtensions.fs#L1114
If you add append
function and support mismatched columns, it will generate exactly the same result as the current merge
function. That might be confusing to user to have two functions with same functionalities. But your implementation will be much faster if column names are the same. I felt it serves more as a special case in merge. What do you think?
Well, in general it doesn't do the same... using merge
you can add columns, my function can't do that. I feel that "append" describes very well what it does, better so than "merge". So I don't think it would be confusing to the user. But in the end it's obviously your decision.
@sebhofer I'm just a new OSS maintainer. I'll just facilitate the discussions to a point that reasonable decision can be made on Deedle.
If it's explicit in documentation that frame.append only works for frames with the same columns but will generate better performance, I think it's fair to add it.
Please rename concat function to append, fix unit tests and add Append in FrameExtension to wrap it up. Thanks!
Oh, okay. Sure, I'll do that.
All said, I just looked at the mergeAll
code again and eventually it would certainly be the best solution if we could make this part perform better. Sadly, this is just above my head at the moment. Maybe @tpetricek has some input here?
Regarding decision making Sometimes someone has to make a final decision and I think it's only fair that this is the person putting most work in. And I think you've done a great job since you've taken over maintaining Deedle!
[Sorry, this is a bit of a rant caused by looking at this...]
I think this PR shows a general difficulty with adding things to Deedle - If you look at how Merge is implemented, you'll find out that it is quite sophisticated. The way it works is that it uses indexBuilder.Merge
to get a command that is then applied to the individual vectors that store the data in the data frame.
This pattern is used in most Deedle operations - the idea is to separate the "indexing" from "data representation". However, it also means that adding new features to Deedle in a "proper" way is quite hard, because you should either use the existing indexing operations (and existing commands) or create new ones (if absolutely necessary).
Also, you do not necessarily get much by doing that - the abstraction lets us do a few things in "Big Deedle" (virtualized data frames), but nobody uses that. Aside from that, it leave some room for future improvements in how things are done (e.g. making the underlying commands faster will make many things faster).
So, in principle, the way this is implemented is going against some of the Deedle principles, but then, in practice, this is a very sensible thing to do.
[Back to the topic]
If we could make this a special case optimization for the existing functionality, I think that would be ideal...
@tpetricek I did indeed look at the merge implementation and I saw it is quite sophisticated. In fact this is one of the reasons why I was against making my concat function a special case of merge: it would basically be two different functions in some kind of branching construct. Although I still think that the functionality of merge and concat are distinct enough to justify a separation into two functions, I also think that merge should have the same performance for this special case in the end. (Especially so as the reason for me coming up with this concat was to speed up unnest, which uses merge.)
So I will try to understand the details of your implementation better and hope that I can come up with a "proper" solution. If you have any suggestions for what to look at to help me understand how to use the index and vector builder commands properly, I'd be grateful.
@zyzhu Sorry for flip-flopping on this, I know we've been discussing this for a while now!
Close this overdue pull request. Could revisit in the future.
Adds a concatenation function appending the columns of multiple frames. This is essentially a special case of
merge
, however much more efficient if the column keys of all input frames are identical.I wrote (a variant of) this function when trying to nest/unnest large frames, which would never finish.
unnest
usesmergeAll
, which (at least when used with a frame created bynest
that has identical column keys) is much to general and possibly much too slow. Down the line I would thus suggest to use theconcat
function forunnest
.I will add unit tests for
concat
in the next couple of days.