Open nalimilan opened 8 years ago
Implementing these methods should allow for
sort(obj::AbstractCategoricalVector)
and dispatch on whether isordered
or not? If it isn't ordered would it use the non-categorical value representation for Forward
?
Whether a categorical array is ordered doesn't make any difference for sort, since it uses isless
rather than <
.
how do you actually sort a 1D CategoricalArray?
a = CategoricalArray(rand(1:10,100))
sort(a)
gives error
ERROR: ArgumentError: CategoricalValue objects with different pools cannot be tested for order
Stacktrace:
[1] isless(::CategoricalArrays.CategoricalValue{Int64,UInt32}, ::CategoricalArrays.CategoricalValue{Int64,UInt32}) at C:\Users\dzj\.julia\v0.6\CategoricalArrays\src\value.jl:127
[2] sort!(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::Base.Order.ForwardOrdering, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:359
[3] sort!(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::Base.Order.ForwardOrdering, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:347 (repeats 2 times)
[4] sort!(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}, ::Base.Sort.MergeSortAlg, ::Base.Order.ForwardOrdering) at .\sort.jl:436
[5] #sort!#7(::Base.Sort.MergeSortAlg, ::Function, ::Function, ::Bool, ::Base.Order.ForwardOrdering, ::Function, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:497
[6] #sort#8(::Array{Any,1}, ::Function, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:546
[7] sort(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:546
[8] eval(::Module, ::Any) at .\boot.jl:235
Woops, there's a problem with MergeSort
since it works with an array allocated via similar
, which has a different pool. Other algorithms work fine. I think we should implement specialized sorting methods which work directly on the integer codes, which will be much faster.
yeah implemented a really fast one in SortingLab.jl
Base already contains a fast path for integer vectors with a small number of levels. If you have code which is faster than the existing one for that, it should probably be added to Base.
But the Base doesn't have a signature for sort(::CategoricalArrays,...)
.
So it's not as easy to use. Because it's sorting so it should go to
SortingAlgorothms.jl but that creates a dependency for
SortingAlgorithms.jl. If it goes to CategoricalArrays then it creates
dependencies as well. That's why I thought a CategoeicalArraysTools.jl with
whatever depdendency might be a good idea.
Also isn't DataFrame sort in DataFrames.jl? I am happy with either way
current, the latest version of SortingLab.jl has a fsort
method for
CategoricalArrays that is quite fast
On 29 Jan. 2018 7:56 am, "Milan Bouchet-Valat" notifications@github.com wrote:
Base already contains a fast path for integer vectors with a small number of levels. If you have code which is faster than the existing one for that, it should probably be added to Base.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JuliaData/CategoricalArrays.jl/issues/12#issuecomment-361095217, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJVJNgvFjzcisK_DZozDmg-ObmrOzks5tPN7wgaJpZM4JvkwJ .
What I meant is that we can implement sort(x::CategoricalArrays,...)
by calling sort(x.refs)
, and that the latter can use the method from Base which has a fast path for array with a small number of levels. But now I see that your fsort
implementation should be very efficient, since it can assume we are able to allocate a vector with one entry per level. Would you make a pull request to add it to CategoricalArrays (with a few tests in test/13_arraycommon.jl
)?
a bit more work than I thought. you actually need to sort by x.pool.order
implemented in #122
https://github.com/JuliaData/CategoricalArrays.jl/pull/224 implements sort
using counting sort. We still need to implement a fast sortperm
(which is probably even more useful), probably taking inspiration from the existing PooledArray
method.
We should review DataArrays's methods and decide which of them should be implemented: https://github.com/JuliaStats/DataArrays.jl/blob/f17e4e30aa0713c794409802741f43cf8ff7f05e/src/pooleddataarray.jl#L576
These should also be used in DataFrames: https://github.com/JuliaStats/DataFrames.jl/blob/8334bf1cbeafba98f038e65a67a732e4125dae89/src/abstractdataframe/sort.jl#L312