JuliaStats / Loess.jl

Local regression, so smooooth!
Other
103 stars 34 forks source link

Loess 0.6 fails (never completes) while Loess 0.5.4 completes in a few seconds #73

Open palday opened 1 year ago

palday commented 1 year ago

Data for MWE (compressed because GitHub doesn't like .arrow as attachments)

loess.arrow.zip

using Arrow
using Loess

tbl = Arrow.Table("loess.arrow")
# runs quickly on 0.5.4; never completes on 0.6.1
@time loess(tbl.x, tbl.y)

cc @dmbates

andreasnoack commented 1 year ago

I've taken a closer look at this. There seems to be two things going on here. The first thing is that the old version used a different rule for deciding on splits when building the KD tree. The rule was different from the one in the Loess papers. The old approach just used the median even when it wasn't unique. The papers search for the index where a change happens. It took some work to figure out exactly how they did it so I wrote a comment. However, this is a linear search for where the x value changes and there are a lot of ties in your data. Specifically, it searches forward and backwards for x == 8 and

julia> countmap(tbl.x)[8]
431154

and it has to do a bit of sorting for each iteration. The second issue is that the partial sort seems to be slower than it should and if I add a function barrier then it's much faster so it seem that the closure is causing some overhead, but even after that change, building the tree for your dataset is prohibitively slow. However, I tried the loess in R which is based on the original Fortran version and it also seems to be really slow for this dataset so I think it's a consequence of the algorithm. Maybe it is possible use a bisection strategy instead of the linear search. I guess the assumption in the original paper was that the number of ties would be small such that the linear search would be fine.