SciML / DiffEqParamEstim.jl

Easy scientific machine learning (SciML) parameter estimation with pre-built loss functions
https://docs.sciml.ai/DiffEqParamEstim/stable/
Other
61 stars 34 forks source link

Problem with the setup of LogLikeLoss in build_loss_objective #104

Closed finmod closed 5 years ago

finmod commented 5 years ago

The logpdf function in the build_loss_objective is not set up properly.

What should take place? For a simple demo model like Lotka-Volterra or Lorenz and ONE DATASET, fit_mle should fit a MvNormal because it is a dataset of 2-3 variables with a full rank covariance matrix. For instance, Lorenz yields:

using Distributions ss = suffstats(MvNormal, onedataset) disfit = fit_mle(MvNormal, ss)

FullNormal( dim: 3 μ: [-4.81219, -4.85575, 24.8367] Σ: [44.017 44.0753 -10.1482; 44.0753 56.2979 -8.55543; -10.1482 -8.55543 52.913] ) i.e. a vector of three means and a positive definite covariance matrix. Distributions.jl is good at this.

Then build_loss_objective should setup the LogLikeLoss of the (data-model) adjusted by the covariance matrix as in this example of Kalman filter (model) from QuantEcon.jl:

function log_likelihood(k::Kalman, y::AbstractVector) eta = y - k.Gk.cur_x_hat # forecast error P = k.Gk.cur_sigmak.G' + k.R # covariance matrix of forecast error logL = - (length(y)log(2pi) + logdet(P) .+ eta'/P*eta)[1]/2 return logL end

For multi samples, the MvMormal and covariance matrix fitting is taking place over and over for each sample. In short, when the ODEProblem or SDEProblem has more than one equation, it should always be MvNormal and covariance matrix.

ChrisRackauckas commented 5 years ago

We do allow for using multivariate distributions, see https://github.com/JuliaDiffEq/DiffEqParamEstim.jl/blob/master/test/likelihood.jl#L46 . That should probably be added to the documentation example. There is a need for both since not every distribution trivially goes multivariable, though indeed if using normal distributions one should probably always use MVNormal.

@Vaibhavdixit02 mind editing the docs?

finmod commented 5 years ago

This is how I discovered the bug. I tried this MvNormal for one sample only instead of 200 and it fails to give me a vector of 2 means and a 2x2 covariance matrix for Lotka-Volterra.

It should always be MvNormal because Distributions.jl checks if it is full rank or ill-conditioned.

ChrisRackauckas commented 5 years ago

What did you run? How did you fit a distribution to one sample?

finmod commented 5 years ago

As described above on 3-eq Lorenz producing one trajectory/dataset called onedataset of 300 observations:

using Distributions ss = suffstats(MvNormal, onedataset) disfit = fit_mle(MvNormal, ss)

FullNormal( dim: 3 μ: [-4.81219, -4.85575, 24.8367] Σ: [44.017 44.0753 -10.1482; 44.0753 56.2979 -8.55543; -10.1482 -8.55543 52.913] ) i.e. a vector of three means and a positive definite covariance matrix. Distributions.jl is good at this.

Instead the current setup produces just one mean and one sigma as you have shown in the docs

ChrisRackauckas commented 5 years ago

That code doesn't seem to run. I assume you generated the dataset, not got it from Distributions?

using Distributions
ss = suffstats(MvNormal, onedataset)

UndefVarError: onedataset not defined
top-level scope at none:0

Instead the current setup produces just one mean and one sigma as you have shown in the docs

Yes, don't use the one in the docs. I pointed to a code which produces 2 means and a 2x2 covariance matrix on Lotka-Volterra: https://github.com/JuliaDiffEq/DiffEqParamEstim.jl/blob/master/test/likelihood.jl#L46-L53

If you have a multivariate distribution that Distributions.jl recognizes (which is not many, it cannot do arbitrary joint distributions), then this is the setup you're looking for. It is not in the documentation example, but it is documented.

finmod commented 5 years ago

The usual stuff:

function lorenz(du,u,p,t) du[1] = p[1](u[2] - u[1]) du[2] = u[1](p[2] - u[3]) - u[2] du[3] = u[1]u[2] - p[3]u[3] end

u0 = [1.0; 0.0; 0.0]
tspan = (0.0, 30.0) p = [10.0,28.0,2.66] prob_lorenz = ODEProblem(lorenz,u0,tspan,p) trajectory = solve(prob_lorenz,Vern9())

then:

t = collect(range(0,stop=30,length=300)) snr = 0.1 randomized = VectorOfArray([(trajectory(t[i]) + snr*randn(3)) for i in 1:length(t)]) onedataset = convert(Array,randomized)

and then:

using Distributions ss = suffstats(MvNormal, onedataset) disfit = fit_mle(MvNormal, ss)

Let me try this L-46-L53 for one sample and see if I get a vector of means and a covariance matrix.

finmod commented 5 years ago

There is still something amiss here if you try to run this example for just one sample. diff_distributions tells you you need at least 2. Then with 2 samples, the objective function says there is a problem with the first array.

BoundsError: attempt to access 1-element Array{MvNormal{Float64,PDMats.PDMat{Float64,Array{Float64,2}},Array{Float64,1}},1} at index [2]

Can we have one sample with a vector of means and its covariance?

ChrisRackauckas commented 5 years ago

There is still something amiss here if you try to run this example for just one sample. diff_distributions tells you you need at least 2. Then with 2 samples, the objective function says there is a problem with the first array.

To use diff_distributions you need at least two sample points, that is indeed correct.

BoundsError: attempt to access 1-element Array{MvNormal{Float64,PDMats.PDMat{Float64,Array{Float64,2}},Array{Float64,1}},1} at index [2]

You only have a single distribution. This cost function API is for a time series of distributions, so I'm not entirely sure what you're looking to do here.

finmod commented 5 years ago

The standard MLE problem for parameter inference is the one described above in QuantEcon.jl

function log_likelihood(k::Kalman, y::AbstractVector) eta = y - k.Gk.cur_x_hat # forecast error P = k.Gk.cur_sigma*k.G' + k.R # covariance matrix of forecast error logL = - (length(y)log(2pi) + logdet(P) .+ eta'/Peta)[1]/2 return logL end

where eta is your loss measure of (data - model) and P the error covariance matrix. The loglikelihood contains the standard logdet(P) term which checks that the covariance is full rank and well-conditioned. It may well be that the loglikelihood function of Distributions.jl computes this quantity directly but I cannot check that from the docs.

Please remember all our earlier exchanges on this subject in 2017.

ChrisRackauckas commented 5 years ago

Okay, so this is a completely different cost function for a different kind of dataset. LogLikeLoss is designed for fairly dense data where there are multiple readings at each of the points along a time series (like from biological time series). At each point in the time series, the multiple reads are collapsed into a distribution, and the fit to each of those distributions is assessed. This naturally takes into account many features like heteroskedasticity, but requires multiple time series to pull data from with data at overlapping points.

What you're asking for is a single distribution which is centered at each of the data points and has a constant covariance matrix. We should definitely do this, but under a different name. Have a suggestion for a name? Is there a paper/resource on this? I don't know where you get the values sigma.

finmod commented 5 years ago

The MLE page of wikipedia has it all: https://en.wikipedia.org/wiki/Maximum_likelihood_estimation .

We covered all this ground in DiffEqParamEstim/issues/29 in June 2017.

In my view, the correct application to biological models is in the parameter estimation section of this paper: https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxhbWlnbzJ0b29sYm94fGd4OjFkYzM5Mzg5NDJjZTdiZDc . It is all about model parameter estimation and not about distribution parameter estimation.

Many standard definitions are provided in Distributions.jl here: https://github.com/JuliaStats/Distributions.jl/blob/master/src/multivariate/mvnormal.jl. Perhaps we should get them involved in formulating our solution. The starting covariance matrix is the one coming from the dataset as shown above using fit_mle. This is known as the unrestricted loglikelihood, i.e. the quantity of information contained in the dataset. The objective function measures the loglikelihood of the model restricted by the parameters.

ChrisRackauckas commented 5 years ago

There is no single "correct application" of parameter estimation. It is dependent on the kind of data you have. We just implemented one that was different from your use case (but fit the kind of problem that I have). For example, if your data at a given time point is not a single value but multiple reads, the methods you're pointing to don't even apply (but the our distributional one does).

In my view, the correct application to biological models is in the parameter estimation section of this paper: https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxhbWlnbzJ0b29sYm94fGd4OjFkYzM5Mzg5NDJjZTdiZDc . It is all about model parameter estimation and not about distribution parameter estimation.

The MLE here is equivalent to a weighted L2 in the heteroscedastic version, and just an L2 in the homoscedastic version. You make the weights the be the inputs of 1/sigma_i and then add a constant to the result. We can put a wrapper over L2Dist that does this, but it doesn't change how it optimizes at all. The interesting part of your question, handling covariances, is not part of that document's method.

We covered all this ground in DiffEqParamEstim/issues/29 in June 2017.

Yes exactly. This part of AMIGO was covered by the implementation of a weighted L2. Usage of multivariate normals is now a new thing.

ChrisRackauckas commented 5 years ago

While this is interesting, note that the way you estimated the distribution is not sensible. Look at your code

function lorenz(du,u,p,t)
du[1] = p[1](u[2] - u[1])
du[2] = u[1](p[2] - u[3]) - u[2]
du[3] = u[1]*u[2] - p[3]*u[3]
end

u0 = [1.0; 0.0; 0.0]
tspan = (0.0, 30.0)
p = [10.0,28.0,2.66]
prob_lorenz = ODEProblem(lorenz,u0,tspan,p)
trajectory = solve(prob_lorenz,Vern9())

then:

t = collect(range(0,stop=30,length=300))
snr = 0.1
randomized = VectorOfArray([(trajectory(t[i]) + snr*randn(3)) for i in 1:length(t)])
onedataset = convert(Array,randomized)

using Distributions
ss = suffstats(MvNormal, onedataset)
disfit = fit_mle(MvNormal, ss)

What you get from the is an MVNormal where the means are the means of the time series and the covariance matrix is for the full time series against the time series mean. This isn't a reasonable distribution to perform a fit against because you don't expect every time point of the solution to be centered at the time series mean.

The more informative multivariable loglikelihood would be the generalization of the one from the AMIGO paper. That would give the one from QuantEcon.jl. Essentially the problem is, given a user's input covariance matrix Omega, estimate the fit of the model to the data using N(u(p,t)-y(t),Omega).

Note that if you then allow Omega(t) and generalize beyond normal distributions, you get our LogLikeLoss. So our LogLikeLoss is a generalization of this formulation, but probably overkill. Here is how it looks using LogLikeLoss:

using OrdinaryDiffEq, DiffEqParamEstim
function lorenz(du,u,p,t)
    du[1] = p[1]*(u[2] - u[1])
    du[2] = u[1]*(p[2] - u[3]) - u[2]
    du[3] = u[1]*u[2] - p[3]*u[3]
end

u0 = [1.0; 0.0; 0.0]
tspan = (0.0, 30.0)
p = [10.0,28.0,2.66]
prob_lorenz = ODEProblem(lorenz,u0,tspan,p)
trajectory = solve(prob_lorenz,Vern9())

t = collect(range(0,stop=30,length=300))
snr = 0.1
randomized = [trajectory(ti) + snr*randn(3) for ti in t]
Ω = [1 0.2 0.5
     0.2 1 0.6
     0.5 0.6 1]

using Distributions
dists = [MvNormal(μ,Ω) for μ in randomized]
loss = LogLikeLoss(t,dists)

We may want a convenience function where the user gives a covariance matrix and it generates this loss.

finmod commented 5 years ago

Yes, this is the general direction of this issue. The restated problem with Ω = [1 0.2 0.5 0.2 1 0.6 0.5 0.6 1] is similar to your stochastic example in the docs but It should work for one dataset only along the line

using Distributions dists = [MvNormal(μ,Ω) for μ in randomized] loss = LogLikeLoss(t,dists)

Currently, MvNormal(μ,Ω) with Ω, a matrix, is not operational.

ChrisRackauckas commented 5 years ago

Currently, MvNormal(μ,Ω) with Ω, a matrix, is not operational.

No, using multivariate distributions like in the code I put above works.

finmod commented 5 years ago

And then, which "build_loss_objective" can be used to feed this loss and onedataset (randomized)?

finmod commented 5 years ago

The "build_loss_objective" function for this type of parameter estimation problem should be of this form : https://nbviewer.jupyter.org/github/JuliaNLSolvers/Optim.jl/blob/gh-pages/latest/examples/generated/maxlikenlm.ipynb with the TwiceDifferentiable command that allows for the calculation of standard errors and asymtotic t-values after the optimization.

finmod commented 5 years ago

The "build_loss_objective" function for the parameter MaxLike Estimation should be similar to the one here: https://nbviewer.jupyter.org/github/JuliaNLSolvers/Optim.jl/blob/gh-pages/latest/examples/generated/maxlikenlm.ipynb where func uses the TwiceDifferentiable command which then allows for the calculations of the standard errors and asymtotic t-values of the estimated parameters.

ChrisRackauckas commented 5 years ago

obj = build_loss_objective(prob,alg,loss) and throw that to the optimizer. Nothing special.

This loss function is pretty much equivalent to what they have written as vars -> Log_Likelihood(x, y, vars[1:nvar], vars[nvar + 1]) (but loss handles multivariate distributions with the covariances taken into account of course!).

None of our loss functions are setup with TwiceDifferentiable, though we should do that. The API for doing that is a little wonky if you're not just targeting Optim though.

finmod commented 5 years ago

With the proposed solution, the objective function is flat and no optimization takes place

obj = build_loss_objective(prob_lorenz,Tsit5(),loss, maxiters=10000,verbose=false)

(::DiffEqObjective{getfield(DiffEqParamEstim, Symbol("##14#18")){Nothing,Bool,Int64,typeof(DiffEqParamEstim.STANDARD_PROB_GENERATOR),Base.Iterators.Pairs{Symbol,Integer,Tuple{Symbol,Symbol},NamedTuple{(:maxiters, :verbose),Tuple{Int64,Bool}}},ODEProblem{Array{Float64,1},Tuple{Float64,Float64},true,Array{Float64,1},ODEFunction{true,typeof(lorenz),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Nothing,DiffEqBase.StandardODEProblem},Tsit5,LogLikeLoss{Array{Float64,1},Array{MvNormal{Float64,PDMats.PDMat{Float64,Array{Float64,2}},Array{Float64,1}},1}},Nothing},getfield(DiffEqParamEstim, Symbol("##17#21"))}) (generic function with 2 methods)

then

bounds = Tuple{Float64, Float64}[(0, 15), (20, 30), (2, 3)] res = bboptimize(obj;SearchRange = bounds, MaxSteps = 11e3)

produces:

Starting optimization with optimizer DiffEvoOpt{FitPopulation{Float64},RadiusLimitedSelector,BlackBoxOptim.AdaptiveDiffEvoRandBin{3},RandomBound{RangePerDimSearchSpace}} 0.00 secs, 0 evals, 0 steps 0.50 secs, 1040 evals, 990 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428 1.00 secs, 2330 evals, 2281 steps, improv/step: 1.000 (last = 0.9992), fitness=718.457108428 1.51 secs, 3478 evals, 3429 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428 2.01 secs, 4669 evals, 4620 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428 2.52 secs, 5903 evals, 5854 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428 3.02 secs, 7119 evals, 7070 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428 3.52 secs, 8337 evals, 8288 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428 4.03 secs, 9414 evals, 9365 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428 4.53 secs, 10623 evals, 10574 steps, improv/step: 1.000 (last = 1.0000), fitness=718.457108428

Optimization stopped after 11001 steps and 4.732000112533569 seconds Termination reason: Max number of steps (11000) reached Steps per second = 2324.8097502918135 Function evals per second = 2335.164779631355 Improvements/step = 1.0 Total function evaluations = 11050

Best candidate found: [13.4164, 22.6071, 2.59954]

ChrisRackauckas commented 5 years ago

Yeah, I'm not surprised. Single shooting single loglikelihood doesn't seem to work well on long chaotic problems. But that's the likelihood you asked for. Could you post your whole code to double check it's the right likelihood construction?

finmod commented 5 years ago

It is the one you proposed above. You may want to replicate this with Lotke-Volterra on all four parameters to separate the model issue from the estimation issue.

function lorenz(du,u,p,t) du[1] = p[1](u[2] - u[1]) du[2] = u[1](p[2] - u[3]) - u[2] du[3] = u[1]u[2] - p[3]u[3] end

u0 = [1.0; 0.0; 0.0] tspan = (0.0, 30.0) p = [10.0,28.0,2.66] prob_lorenz = ODEProblem(lorenz,u0,tspan,p) trajectory = solve(prob_lorenz,Vern9())

t = collect(range(0,stop=30,length=300)) snr = 0.1 randomized = [trajectory(ti) + snr*randn(3) for ti in t] Ω = [1 0.2 0.5 0.2 1 0.6 0.5 0.6 1] then

using Distributions dists = [MvNormal(μ,Ω) for μ in randomized] loss = LogLikeLoss(t,dists)

then

obj = build_loss_objective(prob_lorenz,Tsit5(),loss, maxiters=10000,verbose=false) bounds = Tuple{Float64, Float64}[(0, 15), (20, 30), (2, 3)] res = bboptimize(obj;SearchRange = bounds, MaxSteps = 11e3)

ChrisRackauckas commented 5 years ago

Yeah that looks correct. The issue in terms of applications is probably just that you used

Ω = [1 0.2 0.5
0.2 1 0.6
0.5 0.6 1]

as the choice of covariance matrix which is very large in comparison to the SNR (which is diagonal 0.1s) and flattens out the likelihood. If you have an oracle with which to double check if the likelihood calculation is correct (if you wrote down the QuantEcon thing above but in a loop over the solution points, it should be the same up to floating point error. Internally, we are just calling the loglike function from Distributions.jl on each distribution in the array and summing it up), please do so since that would be a good test.

finmod commented 5 years ago

@tpapp Can you be drawn into this problem of estimating the parameters of the Lorenz systems by MLE from one dataset only using BlackBoxOptim as the optimizer. In short, a dataset produced by a DiffEq system with a covariance.

IndirectLikelihood.jl and MaximumLikelihood.jl propose MLE and estimate_parameters functions and the proper loglikelihood function for this problem to be applied here. These packages are not maintained, presumably because they are deprecated by recent packages somewhere else.

ChrisRackauckas commented 5 years ago

I'm pretty sure every package here will give you out the same exact scalar (calculated the same exact way if they are using Distributions.jl, it's just a sum over likelihood calls), so I don't know what you expect. Why not use a different loss function that is more appropriate for this problem? There are many things that should be done instead if you want to solve this well:

Using simple MLE on a chaotic problem just doesn't have a big basin for the minima, and it's a nice optimizer stress test but I wouldn't recommend trying to fit by (only) using single shooting MLE on any difficult problem.

tpapp commented 5 years ago

@finmod: sorry, I don't see how I could help, it has nothing to do with any of my packages. That said, I must agree with @ChrisRackauckas that MLE may not be the right choice here.

ChrisRackauckas commented 5 years ago

I think I'll just post this and close:

http://www.birs.ca/events/2013/5-day-workshops/13w5151/videos/watch/201311140905-Campbell.html

LogLikeLoss is a nonlinear regression. It's a slightly fancier nonlinear regression than L2 which weights by covariance matrices, but it's still just a nonlinear regression. I don't think any source would suggest that a nonlinear regression is going to work on the Lorenz equation if starting from scratch, so that's just a no-go. Instead, we have the two-stage method, so give that a try. Give multiple shooting a try, which is a mix between collocation and nonlinear regression. This summer we are working on Gaussian Process methods, model relaxation methods, surrogate models, etc, so those methods would be worth trying as well. But this is not the method to try in this case.

finmod commented 5 years ago

@ChrisRackauckas

DiffEqParamEstim and the corresponding page in DiffEqDocs are misnomers of parameter estimation of models and should be relabelled parameter simulation in the more general class of Data Assimilation (DA). If DA, then there are plenty of algorithms for all types of DiffEqs including chaotic systems: the Kalman Filter family, the 4DVAR and others are covered in Julia DataAssim.jl and DAPPER in Python. These are also likely to be much more efficient than the current direction of HMC and NN where anything startling is still to appear.

Parameter estimation is a statistical technique dealing with identification and significance (standard errors) of models and not only data distributions. The roadmap to this approach can be found in Chapter 2 of Seber and Wild, Nonlinear Regression, 2003 as well as Chapter 14 of Kenneth Lange, Numerical Analysis for Statisticians. 2nd edition, 2010. A shortcut to the later is in the slide here: https://github.com/Hua-Zhou/Hua-Zhou.github.io/blob/master/teaching/biostatm280-2017spring/slides/18-newton/newton.ipynb

build_loss_objective is too generic and should be renamed "mle_nonlinear_regression" or LS_nonlinear_regression when the optimization is left out or mle_nl_fit" when the optimization is inside the problem. The optimization problem should unfold all the elementary quantities required for Newton's method, Fisher scoring, quasi-Newton: gradient of log-likelihood, differential of L, Hessian of L, information matrix and expected information matrix and approximations to them (the boxed expressions in the newton notebook). Also see survey of optimization methods https://hua-zhou.github.io/media/pdf/LangeChiZhou14OptmSurvey.pdf .

All of these quantities are matrices that should be checked to be positive definite and invertible. Some of the elements of these matrices can be derived analytically from the distribution postulated for the model parameters: univariate normal, multivariate normal and others. The current build_loss_objective does not deal with any of these properties as it should. Many of these notions used by Lange and Hua-Zhou are already programmed in Julia here: https://github.com/OpenMendel/VarianceComponentModels.jl/blob/master/docs/mle_reml.ipynb . The optimization problem in this notebook extends the methods beyond Newton and quasi-Newton to Majorize-Minorize (MM).

The benchmark problems should be challenging and aligned with the rest of the community. Lotka-Volterra with two parameters is trivial as I can work it out on the back of an envelop. Just throw a decent challenge to the optimization problem!

This issue is closed. You may want to develop in new issues the points about the parameter estimation of dynamical systems that JuliaDiffEq intends to deal with because many of us are interested in the structural parameters of a model. This is not just about blackbox simulation and forecasting. Structural parameters have real meaning: Rayleigh numbers, wave height, proteins, molecules, etc. that hyper-parameters do not have.

ChrisRackauckas commented 5 years ago

build_loss_objective is too generic and should be renamed "mle_nonlinear_regression" or LS_nonlinear_regression when the optimization is left out or mle_nl_fit" when the optimization is inside the problem.

Yes I agree. That would need to wait for a breaking update though of course.

The optimization problem should unfold all the elementary quantities required for Newton's method, Fisher scoring, quasi-Newton: gradient of log-likelihood, differential of L, Hessian of L, information matrix and expected information matrix and approximations to them (the boxed expressions in the newton notebook). Also see survey of optimization methods https://hua-zhou.github.io/media/pdf/LangeChiZhou14OptmSurvey.pdf .

I don't get what you're asking for here. You have access to all of these already except EM, which would need SAEM for the nonlinear version. The elementary quantities for Newton's method are all there by automatic differentiation, and the library is also setup to use adjoints if you ask for it. Anything else you need?

Some of the elements of these matrices can be derived analytically from the distribution postulated for the model parameters: univariate normal, multivariate normal and others.

No, they can't be derived analytically for general nonlinear models. There's still the nonlinear term in there from the simulation.

The benchmark problems should be challenging and aligned with the rest of the community. Lotka-Volterra with two parameters is trivial as I can work it out on the back of an envelop. Just throw a decent challenge to the optimization problem!

Yes, that's not a benchmark problem, it's a test problem.

I'm sorry, but I do not see any actual suggestions in there. Do you have any concrete suggestions other than a renaming of build_loss_objective, which I agree should just reference that it (usually) is nonlinear regression?

ChrisRackauckas commented 5 years ago

BTW, using L2Loss with all of the terms (standard, first differencing, regularization, and he new derivative term https://github.com/JuliaDiffEq/DiffEqParamEstim.jl/pull/105 ) is equivalent to the 4DVAR here https://d-nb.info/1116709740/34 (when the regularization is set correctly).