Termination at saddle point

andreasnoack commented 11 months ago

I'm facing an issue where a linear mixed model terminates at a saddle point for some datasets. I've tried to search for any prior discussion but was unable to find anything. The short version is that BOBYQA terminates at a saddle point where Optim's BFGS does not. I did a brief search for discussions of BOBYQA and saddle points but didn't find anything obvious. I'm wondering if it might be worthwhile to switch to a derivative based optimization algorithm.

Now to the details. I'm working on a bioequivalence testing problem and I don't have the luxury of choosing the model. The model is stated in SAS form in appendix E of https://www.fda.gov/media/70958/download and for most datasets, I'm matching the SAS results exactly with the MixedModels formulation below. However, for some datasets (looks like it's associated with missings), the MixedModels fits end up at saddle points. This dataset is used for the reproducer

saddlepointdata.csv

In the code below, I make tiny perturbations of the observations. Even though they are tiny, it's sufficient to make the difference between reaching an extremum and a saddle point (as I'll show further down)

using MixedModels, CSV, DataFrames, StatsBase, Random
using SparseArrays: nzrange

saddlepointdata = CSV.read("saddlepointdata.csv", DataFrame)
transform!(saddlepointdata, "AUCT" => eachindex => "row")

N = 500
_rng = Random.seed!(Random.default_rng(), 124)
fts = [
    fit(
        LinearMixedModel,
        @formula(log(AUCT) ~
            trt +
            seq +
            per +
            (trt + 0 | sub) +
            zerocorr(trt + 0 | row)
        ),
        transform(
            saddlepointdata,
            "AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT"
        );
        contrasts = Dict(
            [
                :trt,
                :per,
                :sub,
                :row,
            ] .=> Ref(DummyCoding())
        ),
        REML = true
    ) for _ in 1:N
];

countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))

I'm focusing on the second coefficient because it's the one of interest in the equivalence testing and it makes the output less verbose when restricting focus to a single coefficient. The result of the code above is

Dict{Float64, Int64} with 2 entries:
  0.9329 => 483
  0.9301 => 17

The first of these solutions is an extremum and the second is a saddle point. To see this, we can run

using CairoMakie, SixelTerm

saddlepointfit = fts[findfirst(t -> exp(coef(t)[2]) < 0.931, fts)];

plotx = range(-12, stop=12, length=101)
_i = 5
_θ = copy(saddlepointfit.θ)
ploty = [
    MixedModels.deviance(
        MixedModels.updateL!(
            MixedModels.setθ!(
                deepcopy(saddlepointfit), # to avoid mutating the original version
                _θ + [zeros(_i - 1); _x; zeros(5 - _i)]
            )
        )
    )
    for _x in plotx
];
lineplot(plotx, ploty, xlim = extrema(plotx), width = 80)

which gives

so we are clearly not at a minimum.

To dig a little deeper, I'm using a version of the objective function that allows for ForwardDiff. Find the code in the folded section below

Forward compatible MixedModels code

```julia using Optim, LinearAlgebra mybound = x -> [exp.(x[1:3]); x[4]; exp.(x[5:6])] function my_fit( ft::MixedModels.LinearMixedModel; show_trace = false, extended_trace = false ) opt_res = optimize( [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], BFGS(initial_stepnorm = 1.0), Optim.Options(;show_trace, extended_trace), autodiff=:forward ) do θ my_deviance(ft, mybound(θ)) end new_ft = deepcopy(ft) MixedModels.setθ!(new_ft, mybound(opt_res.minimizer)[1:end - 1]) MixedModels.updateL!(new_ft) return new_ft end function my_deviance(model::LinearMixedModel, θ::AbstractVector{T}) where {T} σ² = θ[end]^2 _θ = θ[1:(end-1)] dof = MixedModels.ssqdenom(model) # Extract and promote A, L, reterms = model.A, model.L, model.reterms AA = [LinearAlgebra.copy_oftype(Ai, T) for Ai in A] LL = [LinearAlgebra.copy_oftype(Li, T) for Li in L] RR = [LinearAlgebra.copy_oftype(Ri, T) for Ri in reterms] # Update state with new θ _setθ!(RR, model.parmap, _θ) myupdateL!(AA, LL, RR) r² = _pwrss(LL) ld = _logdet(LL, RR, model.optsum.REML) return dof * log(2 * π * σ²) + ld + r² / σ² end function _setθ!( reterms::Vector{<:MixedModels.AbstractReMat}, parmap::Vector{<:NTuple}, θ::AbstractVector, ) length(θ) == length(parmap) || throw(DimensionMismatch()) reind = 1 λ = first(reterms).λ for (tv, tr) in zip(θ, parmap) tr1 = first(tr) if reind ≠ tr1 reind = tr1 λ = reterms[tr1].λ end λ[tr[2], tr[3]] = tv end return reterms end function myupdateL!(A::Vector, L::Vector, reterms::Vector) k = length(reterms) copyto!(last(L), last(A)) # ensure the fixed-effects:response block is copied for j in eachindex(reterms) # pre- and post-multiply by Λ, add I to diagonal cj = reterms[j] diagind = MixedModels.kp1choose2(j) MixedModels.copyscaleinflate!(L[diagind], A[diagind], cj) for i = (j+1):(k+1) # postmultiply column by Λ bij = MixedModels.block(i, j) MixedModels.rmulΛ!(copyto!(L[bij], A[bij]), cj) end for jj = 1:(j-1) # premultiply row by Λ' MixedModels.lmulΛ!(cj', L[MixedModels.block(j, jj)]) end end for j = 1:(k+1) # blocked Cholesky Ljj = L[MixedModels.kp1choose2(j)] for jj = 1:(j-1) _rankUpdate!( Hermitian(Ljj, :L), L[MixedModels.block(j, jj)], -one(eltype(Ljj)), one(eltype(Ljj)), ) end _cholUnblocked!(Ljj, Val{:L}) LjjT = isa(Ljj, Diagonal) ? Ljj : LowerTriangular(Ljj) for i = (j+1):(k+1) Lij = L[MixedModels.block(i, j)] for jj = 1:(j-1) mul!( Lij, L[MixedModels.block(i, jj)], L[MixedModels.block(j, jj)]', -one(eltype(Lij)), one(eltype(Lij)), ) end rdiv!(Lij, LjjT') end end return nothing end _pwrss(L::Vector) = abs2(last(last(L))) function _logdet(L::Vector, reterms::Vector{<:MixedModels.AbstractReMat}, REML::Bool) @inbounds s = sum(j -> MixedModels.LD(L[MixedModels.kp1choose2(j)]), axes(reterms, 1)) if REML lastL = last(L) s += MixedModels.LD(lastL) # this includes the log of sqrtpwrss s -= log(last(lastL)) # so we need to subtract it from the sum end return s + s # multiply by 2 b/c the desired det is of the symmetric mat, not the factor end LinearAlgebra.copy_oftype(A::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T} = MixedModels.UniformBlockDiagonal(LinearAlgebra.copy_oftype(A.data, T)) function LinearAlgebra.copy_oftype( A::MixedModels.BlockedSparse{<:Any,S,P}, ::Type{T}, ) where {T,S,P} cscmat = LinearAlgebra.copy_oftype(A.cscmat, T) nzsasmat = LinearAlgebra.copy_oftype(A.nzsasmat, T) MixedModels.BlockedSparse{T,S,P}(cscmat, nzsasmat, A.colblkptr) end function LinearAlgebra.copy_oftype(A::MixedModels.ReMat{<:Any,S}, ::Type{T}) where {T,S} MixedModels.ReMat{T,S}( A.trm, A.refs, A.levels, A.cnames, LinearAlgebra.copy_oftype(A.z, T), LinearAlgebra.copy_oftype(A.wtz, T), LinearAlgebra.copy_oftype(A.λ, T), A.inds, LinearAlgebra.copy_oftype(A.adjA, T), LinearAlgebra.copy_oftype(A.scratch, T), ) end function _cholUnblocked!(A::StridedMatrix, ::Type{Val{:L}}) cholesky!(Hermitian(A, :L)) return A end function _cholUnblocked!(D::MixedModels.UniformBlockDiagonal, ::Type{T}) where {T} Ddat = D.data for k in axes(Ddat, 3) _cholUnblocked!(view(Ddat, :, :, k), T) end return D end function _cholUnblocked!(A::Diagonal, ::Type{T}) where {T} A.diag .= sqrt.(A.diag) return A end _cholUnblocked!(A::AbstractMatrix, ::Type{T}) where {T} = MixedModels.cholUnblocked!(A, T) function _rankUpdate!( C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}}, A::StridedMatrix{T}, α, β, ) where {T} Cdat = C.data.data LinearAlgebra.require_one_based_indexing(Cdat, A) isone(β) || rmul!(Cdat, β) blksize = size(Cdat, 1) for k in axes(Cdat, 3) ioffset = (k - 1) * blksize joffset = (k - 1) * blksize for i in axes(Cdat, 1), j = 1:i iind = ioffset + i jind = joffset + j AtAij = 0 for idx in axes(A, 2) # because the second multiplicant is from A', swap index order AtAij += A[iind, idx] * A[jind, idx] end Cdat[i, j, k] += α * AtAij end end return C end function _rankUpdate!( C::LinearAlgebra.HermOrSym{T,UniformBlockDiagonal{T}}, A::MixedModels.BlockedSparse{T,S}, α, β, ) where {T,S} Ac = A.cscmat cp = Ac.colptr all(==(S), diff(cp)) || throw(ArgumentError("Columns of A must have exactly $S nonzeros")) Cdat = C.data.data LinearAlgebra.require_one_based_indexing(Ac, Cdat) j, k, l = size(Cdat) S == j == k && div(Ac.m, S) == l || throw(DimensionMismatch("div(A.cscmat.m, S) ≠ size(C.data.data, 3)")) nz = Ac.nzval rv = Ac.rowval @inbounds for j in axes(Ac, 2) nzr = nzrange(Ac, j) # BLAS.syr!('L', α, view(nz, nzr), view(Cdat, :, :, div(rv[last(nzr)], S))) _x = view(nz, nzr) view(Cdat, :, :, div(rv[last(nzr)], S)) .+= α .* _x .* _x' end return C end function _rankUpdate!( C::LinearAlgebra.HermOrSym{T,S}, A::StridedMatrix{T}, α, β, ) where {T,S} # BLAS.syrk!(C.uplo, 'N', T(α), A, T(β), C.data) C.data .*= β C.data .+= α .* A * A' return C end _rankUpdate!(C::AbstractMatrix, A::AbstractMatrix, α, β) = MixedModels.rankUpdate!(C, A, α, β) ```

Notice that this version of the objective function hasn't concentrated out σ. With this code, I can compute the Hessian which clearly shows the indefiniteness

julia> using ForwardDiff

julia> (objective(saddlepointfit), my_deviance(saddlepointfit, [_θ; saddlepointfit.sigma]))
(-284.3311306944413, -284.33113069939395)

julia>

julia> ForwardDiff.hessian(θσ -> my_deviance(saddlepointfit, θσ), [_θ; saddlepointfit.sigma])
6×6 Matrix{Float64}:
    1.04012         9.41825e-6      0.00227411   0.000198533   1.81058e-8   1469.2
    9.41825e-6      0.58946        -1.33622e-5  -0.0003711     0.000425082  1660.82
    0.00227411     -1.33622e-5      0.860294     0.0357195    -2.86937e-7   1510.92
    0.000198533    -0.0003711       0.0357195    0.288273     -2.35253e-6     -1.50027
    1.81058e-8      0.000425082    -2.86937e-7  -2.35253e-6   -0.051874        1.04004
 1469.2          1660.82         1510.92        -1.50027       1.04004         9.40305e6

Finally, we can use the ForwardDiff compatible code to fit the model with BFGS using ForwardDiff based derivatives

julia> my_fts = [my_fit(ft) for ft in fts];

julia> countmap(round.(exp.(getindex.(coef.(my_fts), 2)), digits=4))
Dict{Float64, Int64} with 1 entry:
  0.9329 => 500

so with BFGS, we never end up at the saddle point (for this dataset). Of course, we don't know if that is the case for other datasets but I wanted to share this observation to start the conversation. Maybe a gradient based optimization should be considered. It will of course require some work since the version I threw together here is rather allocating so can't be used right away but might still be useful as a starting point.

palday commented 11 months ago

@andreasnoack Can you show us versioninfo()?

@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off. That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.

In my own experience, BOBYQA rarely fails on well-posed problems, but we do support using other derivative-free algorithms implemented in NLopt. For example, if we change to COBYLA, things work:

N = 500
_rng = Random.seed!(Random.default_rng(), 124)
formula = @formula(log(AUCT) ~ trt + seq + per +
                   (trt + 0 | sub) +
                   zerocorr(trt + 0 | row))
contrasts = Dict(:trt => DummyCoding(), :per => DummyCoding(),
                 :sub => Grouping(), :row => Grouping())
fts = @showprogress map(1:N) do _
    data = transform(saddlepointdata,
                     "AUCT" => ByRow(t -> t + randn(_rng)*1e-12) => "AUCT")
    model = LinearMixedModel(formula, data; contrasts)
    model.optsum.optimizer = :LN_COBYLA
    return fit!(model; REML=true)
end

countmap(round.(exp.(getindex.(coef.(fts), 2)), digits=3))

yields

Dict{Float64, Int64} with 1 entry:
  0.933 => 500

andreasnoack commented 10 months ago

We have five test dataset where we could reproduce the saddle point. When running the test script above with those datasets, COBYLA always avoids the saddle point solutions that BOBYQA sometimes terminate at.

It would be interesting to know if this an issue with the implementation of BOBYQA or an issue with the algorithm. cc @stevengj.

@dmbates has actually done a fair number of more involved experiments with using the gradient and it never really paid off.

I'd be curious to read more about this. My expectation would be that it would be beneficial to use the gradient.

That said, we've recently discussed revisiting this and evaluating the Hessian efficiently has several uses.

The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.

palday commented 10 months ago

The reason I had the ForwardDiff compatible deviance code handy is actually that I'm using it for the Satterthwaite approximation for the degrees of freedom. I'd be happy to contribute that here if there is an interest.

@andreasnoack definitely! That's been on my secret todo list for a very long time, but other things were always higher priority.

palday commented 10 months ago

@andreasnoack it's stale relative to main, but I think this was the most recent effort: https://github.com/JuliaStats/MixedModels.jl/tree/forwarddiff

palday commented 10 months ago

@andreasnoack see also https://github.com/JuliaStats/MixedModels.jl/issues/312

dmbates commented 5 months ago

As I mention in #742, I have been looking at switching MixedModels.jl to the PRIMA.jl version of bobyqa for the constrained, derivative-free optimization. @zaikunzhang, the developer of libprima, spent considerable amounts of time translating Powell's original f77 code to a modularized modern Fortran. Along the way he discovered and fixed a number of bugs.

I will let you know if I am able to apply PRIMA.bobyqa to this problem

JuliaStats / MixedModels.jl

Termination at saddle point #705