JuliaStats / Loess.jl

Local regression, so smooooth!
Other
103 stars 34 forks source link

KDtree from NearestNeighbors.jl #71

Closed ayushpatnaikgit closed 1 year ago

ayushpatnaikgit commented 1 year ago

NearestNeighbors.jl implements KDtree. In loess, we primarily need to find the points within the span of a given point. This can be done via NearestNeighbors.inrange.

Using KDtrees from NearestNeighbors will lead to simplification in Loess.jl src.

andreasnoack commented 1 year ago

The KD trees in a Loess are used differently. To cite https://link.springer.com/article/10.1007/BF01890836

The k-d tree ... was originally designed for answering nearest-neighbor queries but we use it in a different way. (Ironically, the tree is not of assistance in identifying the neighborhoods. The trouble is that α is commonly large enough that each neighborhood intersects most of the cells in the k-d tree.)

Maybe it's possible to use the KD-tree implementation from NearestNeighbors but it might use different splitting and stopping rules (haven't checked) and the version here is already written.

ayushpatnaikgit commented 1 year ago

The KD trees in a Loess are used differently. To cite https://link.springer.com/article/10.1007/BF01890836

The k-d tree ... was originally designed for answering nearest-neighbor queries but we use it in a different way. (Ironically, the tree is not of assistance in identifying the neighborhoods. The trouble is that α is commonly large enough that each neighborhood intersects most of the cells in the k-d tree.)

Maybe it's possible to use the KD-tree implementation from NearestNeighbors but it might use different splitting and stopping rules (haven't checked) and the version here is already written.

Good point, instead of doing least-squares for all n, loess approximates the curve by only doing it for the vertices. I was aware of this approximation, but I didn't know it was being done by exploiting kdtrees for this. Pretty clever.

We hope the number of vertices will be much smaller than n. This is at least true asymptotically, because the number of cells needed to fit the data properly depends on the smoothness of g(x), not n. In Fig. 4 there are 66 vertices, so we solve just 66 least-squares problems instead of the thousands necessary to draw a contour plot.

You may not get identical results, but conceptually, some differences in the splitting rule won't matter.

I think the same about the stopping rule, but I am not entirely sure. I think the stopping rule that they have chosen is to help make the FORTRAN implementation simpler.

An additional stopping criterion is needed in a FORTRAN implementation. Only a fixed-size array is available for storing vertices, so there may be no room to continue refining the tree. For this reason, we refine in breadth-first order so that stopping at any time spreads the error uniformly. Ordinarily, k-d tree algorithms have been implemented in depth-first order, but breadth-first is no harder.

We don't have this constraint in Julia. But depth-first will conceptually change things a little, perhaps for the better.

In summary, I believe using NearestNeighbors.jl may not lead to the identical approximation as the one already implemented in the package, but it's the same idea. It's not critical to move to NearestNeighbors.jl as the existing implementation works. Also, it can't lead to a big speedup, as more of the time is presumably spent on least-squares. However, moving to NearestNeighbors.jl will lead to simplification of src, and upstream improvements.

Perhaps it should be done in a new Loess package, and not in this.

andreasnoack commented 1 year ago

You may not get identical results, but conceptually, some differences in the splitting rule won't matter.

Indeed, but it makes it easier to test since we can then just compare with R's loess.