JuliaData / CategoricalArrays.jl

Arrays for working with categorical data (both nominal and ordinal)
Other
125 stars 34 forks source link

Implement optimized sort functions #12

Open nalimilan opened 8 years ago

nalimilan commented 8 years ago

We should review DataArrays's methods and decide which of them should be implemented: https://github.com/JuliaStats/DataArrays.jl/blob/f17e4e30aa0713c794409802741f43cf8ff7f05e/src/pooleddataarray.jl#L576

These should also be used in DataFrames: https://github.com/JuliaStats/DataFrames.jl/blob/8334bf1cbeafba98f038e65a67a732e4125dae89/src/abstractdataframe/sort.jl#L312

Nosferican commented 6 years ago

Implementing these methods should allow for

sort(obj::AbstractCategoricalVector)

and dispatch on whether isordered or not? If it isn't ordered would it use the non-categorical value representation for Forward?

nalimilan commented 6 years ago

Whether a categorical array is ordered doesn't make any difference for sort, since it uses isless rather than <.

xiaodaigh commented 6 years ago

how do you actually sort a 1D CategoricalArray?

a = CategoricalArray(rand(1:10,100))
sort(a)

gives error

ERROR: ArgumentError: CategoricalValue objects with different pools cannot be tested for order
Stacktrace:
 [1] isless(::CategoricalArrays.CategoricalValue{Int64,UInt32}, ::CategoricalArrays.CategoricalValue{Int64,UInt32}) at C:\Users\dzj\.julia\v0.6\CategoricalArrays\src\value.jl:127
 [2] sort!(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::Base.Order.ForwardOrdering, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:359
 [3] sort!(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}, ::Int64, ::Int64, ::Base.Sort.MergeSortAlg, ::Base.Order.ForwardOrdering, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:347 (repeats 2 times)
 [4] sort!(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}, ::Base.Sort.MergeSortAlg, ::Base.Order.ForwardOrdering) at .\sort.jl:436
 [5] #sort!#7(::Base.Sort.MergeSortAlg, ::Function, ::Function, ::Bool, ::Base.Order.ForwardOrdering, ::Function, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:497
 [6] #sort#8(::Array{Any,1}, ::Function, ::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:546
 [7] sort(::CategoricalArrays.CategoricalArray{Int64,1,UInt32,Int64,CategoricalArrays.CategoricalValue{Int64,UInt32},Union{}}) at .\sort.jl:546
 [8] eval(::Module, ::Any) at .\boot.jl:235
nalimilan commented 6 years ago

Woops, there's a problem with MergeSort since it works with an array allocated via similar, which has a different pool. Other algorithms work fine. I think we should implement specialized sorting methods which work directly on the integer codes, which will be much faster.

xiaodaigh commented 6 years ago

yeah implemented a really fast one in SortingLab.jl

nalimilan commented 6 years ago

Base already contains a fast path for integer vectors with a small number of levels. If you have code which is faster than the existing one for that, it should probably be added to Base.

xiaodaigh commented 6 years ago

But the Base doesn't have a signature for sort(::CategoricalArrays,...). So it's not as easy to use. Because it's sorting so it should go to SortingAlgorothms.jl but that creates a dependency for SortingAlgorithms.jl. If it goes to CategoricalArrays then it creates dependencies as well. That's why I thought a CategoeicalArraysTools.jl with whatever depdendency might be a good idea.

Also isn't DataFrame sort in DataFrames.jl? I am happy with either way current, the latest version of SortingLab.jl has a fsortmethod for CategoricalArrays that is quite fast

On 29 Jan. 2018 7:56 am, "Milan Bouchet-Valat" notifications@github.com wrote:

Base already contains a fast path for integer vectors with a small number of levels. If you have code which is faster than the existing one for that, it should probably be added to Base.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JuliaData/CategoricalArrays.jl/issues/12#issuecomment-361095217, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJVJNgvFjzcisK_DZozDmg-ObmrOzks5tPN7wgaJpZM4JvkwJ .

nalimilan commented 6 years ago

What I meant is that we can implement sort(x::CategoricalArrays,...) by calling sort(x.refs), and that the latter can use the method from Base which has a fast path for array with a small number of levels. But now I see that your fsort implementation should be very efficient, since it can assume we are able to allocate a vector with one entry per level. Would you make a pull request to add it to CategoricalArrays (with a few tests in test/13_arraycommon.jl)?

xiaodaigh commented 6 years ago

a bit more work than I thought. you actually need to sort by x.pool.order

xiaodaigh commented 6 years ago

implemented in #122

nalimilan commented 4 years ago

https://github.com/JuliaData/CategoricalArrays.jl/pull/224 implements sort using counting sort. We still need to implement a fast sortperm (which is probably even more useful), probably taking inspiration from the existing PooledArray method.