Open palday opened 1 year ago
I've taken a closer look at this. There seems to be two things going on here. The first thing is that the old version used a different rule for deciding on splits when building the KD tree. The rule was different from the one in the Loess papers. The old approach just used the median even when it wasn't unique. The papers search for the index where a change happens. It took some work to figure out exactly how they did it so I wrote a comment. However, this is a linear search for where the x value changes and there are a lot of ties in your data. Specifically, it searches forward and backwards for x == 8
and
julia> countmap(tbl.x)[8]
431154
and it has to do a bit of sorting for each iteration. The second issue is that the partial sort seems to be slower than it should and if I add a function barrier then it's much faster so it seem that the closure is causing some overhead, but even after that change, building the tree for your dataset is prohibitively slow. However, I tried the loess
in R which is based on the original Fortran version and it also seems to be really slow for this dataset so I think it's a consequence of the algorithm. Maybe it is possible use a bisection strategy instead of the linear search. I guess the assumption in the original paper was that the number of ties would be small such that the linear search would be fine.
Data for MWE (compressed because GitHub doesn't like
.arrow
as attachments)loess.arrow.zip
cc @dmbates