Open adienes opened 2 years ago
Furthermore, since both findmax(eachcol(df)) and argmax(eachcol(df)), return a key
DataFrames.jl does not define findmax
and argmax
. What you get is a consequence of a standard definition of these functions in Base Julia
I am wondering if we could define a
sortperm
Yes, I think we could. It requires consideration of what you mention. How is "permutation vector" defined is a part of the issue. The other is what functions we need to define for consistency (e.g. sort
, copy
; how would it play with invperm
and similar functions) - I would need to think.
CC @nalimilan for opinion
Interestingly, it seems that OrderedDict
in DataStructures.jl
supports sort
but not sortperm.
Maybe this functionality can be added there as well, hopefully in a consistent way with whatever is decided here.
This is the point - adding methods from Base Julia should not be taken lightly and requires careful design to make sure all is consistent.
This is exactly the reason why DataFrameColumns
is not AbstractVector
. We could not make such subtyping because it would break some methods in Base Julia.
Upon thinking about it more, I also noticed that
p = pairs([1, 5, 3, 4, 2])
sort(p) #error
sortperm(p) #error
If there were sorting support in Base for Iterators.Pairs
objects then I think sorting would just work automatically on DataFrameColumns
without special support within DataFrames.jl
. The only thing to be careful of is that now "permuation vector I
" would in fact be an iterator, and the expected behavior of obj[I]
would really be obtained by something like collect(c => obj[c] for c in I)
This could definitely be nice in other contexts in Base as well, like trying to return a sorted iterator over key,value pairs for a standard Dict
. In fact, sort
already works but sortperm
does not, but in this case it is just special-cased to return an OrderedDict
. I am a little surprised by this as well, as I feel like it would be more general to allow sorting over any Iterator.Pairs
.
If you think this is a good idea, I could remove this issue and create a new one over in base to discuss adding support for sorting pairs more generally speaking.
I suspect a sort
or sortperm
method for non-vector iterators would probably be rejected, as it cannot be implemented without first collecting values into a vector, making it identical to sort!(collect(itr))
. Also the sort
method for Dict
comes from DataStructures.jl and is deprecated (see https://github.com/JuliaCollections/OrderedCollections.jl/issues/25). The problem is that there's no ordered dict in Base so it's impossible to return a sorted structure. Returning an iterator that is not actually lazy does not make much sense, and unfortunately there's no way to sort something without storing sorted indices somewhere.
So we will likely have to define our own sorting methods for DataFrameColumns
if we want this to work. The tricky part is that while the findmax
and argmax
fallbacks return column names, findall
, findnext
and findprev
(which are defined in DataFrames) return integer indices instead. I would tend to favor returning indices, as a permutation vector consisting of symbols sounds a bit unusual. Also it's easier and more efficient to convert indices into names using keys(dfc)[inds]
than doing the reverse operation.
That said, it's clear we have an inconsistency in the API. Apart from findmax
and argmax
, we have the more general inconsistency that findall
doesn't behave as the docstring in Base says it should:
Indices or keys are of the same type as those returned by keys(A) and pairs(A).
Maybe not a big deal in practice though. We have a similar issue as NamedArrays.jl/AxisArrays.jl/etc., DataFrameColumns
has both fast integer indices, and somewhat slower symbol/string keys. There's no generic API for this yet at AbstractArray
doesn't have the two concepts.
I don't think it would be quite identical to sort!(collect(itr))
since it is my impression that collect
is not guaranteed to return the same order upon subsequent calls for all iterators? (e.g. a Dict). One would have to make a temporary vector holding the keys, then sort those by the associated values. However, point taken that it is not fundamentally new functionality and may not be appropriate in Base.
I will say I would find it much more ergonomic to be able to get the top few keys of a DataFrame without having to index into them with a vector of indices; if you think that sort
would be more consistent returning indices, maybe there can be two functions, with one of them being sortkeys
? I do not mean to introduce bloat, but coming from Pandas where this functionality is present as .argsort()
I am finding this to be one of the few rough spots in my workflow, and it really seems like there should be a nice way to do this.
I am finding this to be one of the few rough spots in my workflow, and it really seems like there should be a nice way to do this.
To derail a bit from the main issue. No one previously requested this functionality. What are the use-cases/examples where this is useful in practice? Thank you!
No one previously requested this functionality. What are the use-cases/examples where this is useful in practice?
Point taken :)
My use case is I am implementing some voting algorithms; a column header would be the name of a political candidate or party, and each row is one ballot. To give you an example, you could check out my (probably very rough) Julia implementation here https://electowiki.org/wiki/Threshold_Equal_Approval.
Notice my uses of findmin
and findmax
there. For this specific case that works just fine, but there are variations to this algorithm where I would instead take the lowest (or highest, respectively) X columns and then apply some tiebreaking algorithm among those X.
Point taken :)
It is not a problem. We want to support various use-cases. It is just that this one is so rare that it was not requested before. We will think about it in 1.5 release.
Also note, that in general you can e.g. do
df[:, sortperm(collect(eachcol(df)))]
and you have a data frame ordered the way you want (if I understand your use-case correctly).
I found myself wanting to find the keys of the DataFrame, ordered by some comparator of their columns. As it stands, it seems that the most straightforward way to do this is
I am wondering if we could define a
sortperm
(and I suppose all related sorting functions) directly on aDataFrameColumns
object? It seems like it might be able to write more simplysortperm(eachcol(df))
Now a question arises, since
DataFrameColumns
is in the uncommon scenario of being accessible by both indices and keys, is shouldsortperm
return a permutation of indices or of keys? In base Julia, it seems sorting functions are defined onAbstractVector
, so it is usually assumed that this will return indices.However. Looking at the docstring for e.g.
sortperm
, it says the promise is thatsortperm
returns "a permutation vectorI
that putsv[I]
in sorted order."Say
keys(eachcol(df))
is[:a, :b, :c]
, then sinceeachcol(df)[:b, :c, :a]
returns a DataFrameColumns object with the columns in that order of[:b, :c, :a]
, I think we can consider that a permutation vectorI
. Furthermore, since bothfindmax(eachcol(df))
andargmax(eachcol(df))
, the latter of which is basically an alias for the former, return a key---the underlying implementation will sort overpairs(eachcol(df))
---I think that is another strong reason thatsortperm
should also return a vector of keys, to retain the parity thatargmax(eachcol(df)) == first(sortperm(eachcol(df)))
That all is to say, I would love to see added functionality as small as the following:
sortperm(dfc::DataFrameColumns) = keys(dfc)[sortperm(collect(dfc))]
Syntactic sugar for the existing pattern I am using.