PosDefException with simple linear regression when x/y ranges are small

mortenpi commented 4 years ago

I'm getting a PosDefException: matrix is not positive definite; Cholesky factorization failed when doing a simple lm(@formula(y ~ x), df) when the range of x and y values is tiny compared to their absolute value. Should be possible to replicate with this dataset:

δ(n; q=3) = mod(n, q)/(q-1) - 1
function dataset(N)
    x0, Δx, δx = 13.5, 1e-8, 1e-8
    y0, Δy, δy = 200, 1e-7, 2e-7
    DataFrame(
        x = x0 .+ Δx .* (-N:N) .+ δx .* δ.(1:2N+1; q=3),
        y = y0 .+ Δy .* (-N:N) .+ δy .* δ.(3:2N+3; q=5),
    )
end
df = dataset(10)
lm(@formula(y ~ x), df)

I have a full MWE here as a gist.

I can see how this is numerically an edge case, but I feel that it should ideally just work. For my purposes, I can pretty easily work around the issue by just shifting and scaling the x and y values, fitting and then scaling the coefficients back:

function linreg(df::DataFrame, x::Symbol, y::Symbol)
    xs, ys = df[:,x], df[:,y]
    xmin, xmax = extrema(xs)
    ymin, ymax = extrema(ys)
    _df = DataFrame(x=(xs .- xmin) ./ (xmax - xmin), y=(ys .- ymin) ./ (ymax - ymin))
    m = lm(@formula(y ~ x), _df)
    _b, _k = coef(m)
    k = _k * (ymax - ymin) / (xmax - xmin)
    b = _b * (ymax - ymin) + ymin - xmin*k
    return b, k
end

Seems to give a reasonable result:

fit

andreasnoack commented 4 years ago

I think this is one of the relatively rare cases where you'd need the extra precision that the QR factorization provides relative to the Cholesky. The functionality for using QR is there but we don't expose it in a nice way so let's keep this one open to track the progress of that.

mortenpi commented 4 years ago

Would it make sense for the logic to be "Try Cholesky. If error, try QR"? I.e so that the user wouldn't have to do anything?

andreasnoack commented 4 years ago

So far, we have aimed for type stability and that wouldn't be possible with the approach you are suggesting. I'd prefer that we just provide an easy way to switch to QR and we could maybe also consider suggesting the QR factorization version in the error message thrown when the Cholesky fails.

casasgomezuribarri commented 3 years ago

Hi, sorry for reviving this. I am having a similar issue with LinearBinaryClassifier, and I am really lost about why it could be. The error I get when I try to fit the machine (I am using MLJ) is the same as described here:

ERROR: LoadError: TaskFailedException:
PosDefException: matrix is not positive definite; Cholesky factorization failed.

My dataset has 100 rows and 3500 columns, and all features are of similar magnitude. The minimum value across the entire X array is -0.016700 and the maximum value is 0.354000, so I wouldn't say the range is small compared to their absolute value.

Any idea of why this could be happening?

EDIT: I think I know what the reason might be now. Since my X dataset contains the infrared absorbance at several (3500) wavenumbers, it happens that some pairs of wavenumbers have the exact same absorbance for all samples (making them essentially the same variable). This has been solved in a few other issues by setting the third argument of lm() to true.

However, since my model is embedded inside a pipeline, my implementation is to wrap the pipeline in a machine and use fit!().

Is there a way of allowing collinearity in this context?

@load LinearBinaryClassifier pkg=GLM
LinearBinaryClassifierPipe = @pipeline(Standardizer(),
                                       OneHotEncoder(drop_last = true),
                                       LinearBinaryClassifier())
LogisticModel = machine(LinearBinaryClassifierPipe, X, yc)
fit!(LogisticModel)

nalimilan commented 3 years ago

@casasgomezuribarri You have much more variables than observations so it doesn't make since to fit a GLM. You probably want to apply some kind of dimensionality reduction algorithm like partial least squares.

JuliaStats / GLM.jl

PosDefException with simple linear regression when x/y ranges are small #375