Closed jrfiedler closed 1 year ago
I think this might mostly be an issue with throwing errors or raise warnings. For reference, In R the first example gives
> df
x y
1 1 1
2 2 2
3 3 3
4 4 4
> ft = loess(y~x, df, span=0.25)
There were 15 warnings (use warnings() to see them)
> predict(ft, df)
Error in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
NA/NaN/Inf in foreign function call (arg 5)
You are asking for the function to compute quadratic approximations to single data points which is not going to work. I guess we should fail if the local problems end up being underdetermined.
I've looked a bit more into this and it does look like there are a couple of other issues
The predict
function assumes that the vertices returned in https://github.com/JuliaStats/Loess.jl/blob/master/src/Loess.jl#L147 are distinct but that is often not the case for the first value. E.g.
julia> Loess.traverse(Loess.KDTree([0.0 1.0 2.0]'), [0.0])
2-element Vector{Tuple{Float64}}:
(0.0,)
(0.0,)
A possible solution is to avoid the interpolation in https://github.com/JuliaStats/Loess.jl/blob/cb38f5e224b30b372b386c9fbc5d4cc662daf5fa/src/Loess.jl#L152-L153 when the input value is one of the vertices. I'll prepare a PR with a fix.
I'm don't really understand why the KDTree isn't using the input data as the vertices instead of using medians. I believe using the input data would avoid (at least most of) the rounding that causes
4.000000000000001
4.000000000000001
More generally, it might be a good idea to just use https://github.com/KristofferC/NearestNeighbors.jl instead of an internal structure for this.
I've had a similar issue. I am curious about how the package handles NaN. I have missing data points, but since the loess function doesn't accept missing, I converted them to NaN.
I am pasting the real data, so I apologize for the legibility. You can simply copy and paste these in REPL if you want to try.
y = [14.73, 17.08, 17.63, NaN, 14.51, 16.75, 18.46, 17.24, 18.8, 18.73, 14.64, 20.61, 18.87, 19.76, NaN, 5.99, 13.12, 16.82, 23.63, 20.77, 13.93, 17.75, 20.16, 19.8, 17.43, 17.39, 16.59, 13.41, 9.4, 15.2, 16.06, 10.95, 16.37, 15.99, 19.03, 15.5, 23.77, 14.75, 7.7, 11.9, 10.37, 10.74, 19.1, 19.95, 17.84, 16.29, 21.02, 18.93, 18.4, 18.31, 13.27, 9.09, 8.18, 10.94, 22.18, 16.37, 22.97, 17.64, 15.44, 14.81, 15.49, 15.22, 9.82, 16.15, 18.13, 28.12, 15.36, 15.17, 25.17, 16.05, 19.82, 15.22, 18.77, 14.05, 14.13, 4.77, NaN, 18.39, 17.51, 21.99, 23.86, 22.13, 23.98, 19.52, 22.06, 19.27, 21.7, 19.01, 21.15, NaN, 26.61, 22.15, 23.61, 19.16, 23.07]
x = [11.0, 15.0, 2.0, 0.0, 3.0, 2.0, 9.0, 14.0, 13.0, 14.0, 16.0, 15.0, 16.0, 13.0, 0.0, 1.0, 3.0, 4.0, 6.0, 13.0, 16.0, 15.0, 13.0, 17.0, 14.0, 13.0, 9.0, 2.0, 2.0, 8.0, 10.0, 13.0, 13.0, 10.0, 15.0, 12.0, 14.0, 15.0, 2.0, 5.0, 2.0, 3.0, 9.0, 12.0, 14.0, 11.0, 11.0, 14.0, 14.0, 9.0, 5.0, 1.0, 4.0, 10.0, 12.0, 17.0, 16.0, 18.0, 13.0, 18.0, 16.0, 14.0, 5.0, 4.0, 4.0, 4.0, 11.0, 16.0, 13.0, 15.0, 16.0, 11.0, 13.0, 12.0, 3.0, 1.0, 0.0, 10.0, 9.0, 14.0, 15.0, 18.0, 13.0, 18.0, 13.0, 13.0, 3.0, 2.0, 1.0, 0.0, 9.0, 11.0, 11.0, 19.0, 18.0]
There are 4 NaN in y.
Now I build the model and predict.
model = loess(y, x)
v = predict(model,x)
It turns out, there are 29 NaN in v.
A few examples showing that
NaN
is frequently returned at the boundary, and a few zeros inside the range of the input. For all of these, the predictions are being made on the samex
values that were put intoloess
.I'm using Julia 1.0.1, with
Loess
installed today.Example 1
Part 1
The boundary predictions are
NaN
. The middle two predictions are zero, even though all of the inputy
values are positive.Part 2
Same
x
andy
, changingspan
to 0.33.The boundary predictions aren't
NaN
, but the middle two values are again zero.Example 2
Part 1
The boundary predictions are
NaN
, and one of the middle predictions is zero.Part 2
Changing
span
to 0.4 and 0.5 produces an error.Part 3
Changing
span
to 0.6Predictions on the left boundary are
NaN
.