JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

DataFrameColumns support for sorting #3139

Open adienes opened 2 years ago

adienes commented 2 years ago

I found myself wanting to find the keys of the DataFrame, ordered by some comparator of their columns. As it stands, it seems that the most straightforward way to do this is

keys(eachcol(df))[sortperm(collect(eachcol(df)))]

I am wondering if we could define a sortperm (and I suppose all related sorting functions) directly on a DataFrameColumns object? It seems like it might be able to write more simply sortperm(eachcol(df))

Now a question arises, since DataFrameColumns is in the uncommon scenario of being accessible by both indices and keys, is should sortperm return a permutation of indices or of keys? In base Julia, it seems sorting functions are defined on AbstractVector, so it is usually assumed that this will return indices.

However. Looking at the docstring for e.g. sortperm, it says the promise is that sortperm returns "a permutation vector I that puts v[I] in sorted order."

Say keys(eachcol(df)) is [:a, :b, :c], then since eachcol(df)[:b, :c, :a] returns a DataFrameColumns object with the columns in that order of [:b, :c, :a], I think we can consider that a permutation vector I. Furthermore, since both findmax(eachcol(df)) and argmax(eachcol(df)), the latter of which is basically an alias for the former, return a key---the underlying implementation will sort over pairs(eachcol(df))---I think that is another strong reason that sortperm should also return a vector of keys, to retain the parity that argmax(eachcol(df)) == first(sortperm(eachcol(df)))

That all is to say, I would love to see added functionality as small as the following:

sortperm(dfc::DataFrameColumns) = keys(dfc)[sortperm(collect(dfc))]

Syntactic sugar for the existing pattern I am using.

bkamins commented 2 years ago

Furthermore, since both findmax(eachcol(df)) and argmax(eachcol(df)), return a key

DataFrames.jl does not define findmax and argmax. What you get is a consequence of a standard definition of these functions in Base Julia

I am wondering if we could define a sortperm

Yes, I think we could. It requires consideration of what you mention. How is "permutation vector" defined is a part of the issue. The other is what functions we need to define for consistency (e.g. sort, copy; how would it play with invperm and similar functions) - I would need to think.

CC @nalimilan for opinion

adienes commented 2 years ago

Interestingly, it seems that OrderedDict in DataStructures.jl supports sort but not sortperm. Maybe this functionality can be added there as well, hopefully in a consistent way with whatever is decided here.

bkamins commented 2 years ago

This is the point - adding methods from Base Julia should not be taken lightly and requires careful design to make sure all is consistent.

This is exactly the reason why DataFrameColumns is not AbstractVector. We could not make such subtyping because it would break some methods in Base Julia.

adienes commented 2 years ago

Upon thinking about it more, I also noticed that

p = pairs([1, 5, 3, 4, 2])
sort(p) #error
sortperm(p) #error

If there were sorting support in Base for Iterators.Pairs objects then I think sorting would just work automatically on DataFrameColumns without special support within DataFrames.jl. The only thing to be careful of is that now "permuation vector I" would in fact be an iterator, and the expected behavior of obj[I] would really be obtained by something like collect(c => obj[c] for c in I)

This could definitely be nice in other contexts in Base as well, like trying to return a sorted iterator over key,value pairs for a standard Dict. In fact, sort already works but sortperm does not, but in this case it is just special-cased to return an OrderedDict. I am a little surprised by this as well, as I feel like it would be more general to allow sorting over any Iterator.Pairs.

If you think this is a good idea, I could remove this issue and create a new one over in base to discuss adding support for sorting pairs more generally speaking.

nalimilan commented 2 years ago

I suspect a sort or sortperm method for non-vector iterators would probably be rejected, as it cannot be implemented without first collecting values into a vector, making it identical to sort!(collect(itr)). Also the sort method for Dict comes from DataStructures.jl and is deprecated (see https://github.com/JuliaCollections/OrderedCollections.jl/issues/25). The problem is that there's no ordered dict in Base so it's impossible to return a sorted structure. Returning an iterator that is not actually lazy does not make much sense, and unfortunately there's no way to sort something without storing sorted indices somewhere.

So we will likely have to define our own sorting methods for DataFrameColumns if we want this to work. The tricky part is that while the findmax and argmax fallbacks return column names, findall, findnext and findprev (which are defined in DataFrames) return integer indices instead. I would tend to favor returning indices, as a permutation vector consisting of symbols sounds a bit unusual. Also it's easier and more efficient to convert indices into names using keys(dfc)[inds] than doing the reverse operation.

That said, it's clear we have an inconsistency in the API. Apart from findmax and argmax, we have the more general inconsistency that findall doesn't behave as the docstring in Base says it should:

Indices or keys are of the same type as those returned by keys(A) and pairs(A).

Maybe not a big deal in practice though. We have a similar issue as NamedArrays.jl/AxisArrays.jl/etc., DataFrameColumns has both fast integer indices, and somewhat slower symbol/string keys. There's no generic API for this yet at AbstractArray doesn't have the two concepts.

adienes commented 2 years ago

I don't think it would be quite identical to sort!(collect(itr)) since it is my impression that collect is not guaranteed to return the same order upon subsequent calls for all iterators? (e.g. a Dict). One would have to make a temporary vector holding the keys, then sort those by the associated values. However, point taken that it is not fundamentally new functionality and may not be appropriate in Base.

I will say I would find it much more ergonomic to be able to get the top few keys of a DataFrame without having to index into them with a vector of indices; if you think that sort would be more consistent returning indices, maybe there can be two functions, with one of them being sortkeys ? I do not mean to introduce bloat, but coming from Pandas where this functionality is present as .argsort() I am finding this to be one of the few rough spots in my workflow, and it really seems like there should be a nice way to do this.

bkamins commented 2 years ago

I am finding this to be one of the few rough spots in my workflow, and it really seems like there should be a nice way to do this.

To derail a bit from the main issue. No one previously requested this functionality. What are the use-cases/examples where this is useful in practice? Thank you!

adienes commented 2 years ago

No one previously requested this functionality. What are the use-cases/examples where this is useful in practice?

Point taken :)

My use case is I am implementing some voting algorithms; a column header would be the name of a political candidate or party, and each row is one ballot. To give you an example, you could check out my (probably very rough) Julia implementation here https://electowiki.org/wiki/Threshold_Equal_Approval.

Notice my uses of findmin and findmax there. For this specific case that works just fine, but there are variations to this algorithm where I would instead take the lowest (or highest, respectively) X columns and then apply some tiebreaking algorithm among those X.

bkamins commented 2 years ago

Point taken :)

It is not a problem. We want to support various use-cases. It is just that this one is so rare that it was not requested before. We will think about it in 1.5 release.

Also note, that in general you can e.g. do

df[:, sortperm(collect(eachcol(df)))]

and you have a data frame ordered the way you want (if I understand your use-case correctly).