sigP gradient error - Githubissues

FYI, I found a possible issue with the sigP parameter gradient while testing the gradient and Hessian for the 24layer model with P, C, and O models on.

The Problem: The explicitly computed gradient w.r.t. sigP does not match the complex step method. But in the Hessian, when only the 2nd derivative is w.r.t. sigP, the Hessian is consistent with the complex step method. This suggests that the problem is probably in the gradient definition for sigP.

I tested this using my version of the code in the Code_withCellModel folder, but the issue occurred even when the cell model is set to off and rO2C is a constant parameter (in which case, the code should operate exactly the same as Code_24layer)

Steps Taken So far, I have tried to diagnose where the error is introduced by testing if the problem is in eqPcycle, eqOcycle, or eqCcyle

[ x ] Test1: run GHtest, with only opt_sigP turned on, and the Cmodel and Omodel off, to see if the problem is in eqPcycle.
- Result: explicit gradient and hessian were both consistent with complex step method, suggesting that the problem is not in eqPcycle
[ x ] Test2: run GHtest with only opt_sigP and opt_sigC turned on (Cmodel = on; Omodel = off)
- Result: sigP gradient did not match complex step method

My conclusion so far is that there is a bug in eqCcycle_v2.m in the gradient with respect to sigP. I have not identified where exactly the issue is yet, but thought I would post it here, in case anyone else has encountered or fixed this issue in other versions of the BGC model.

It looks like the error is in the DIC equation gradient w.r.t sigP in eqCcycle_v2.

test_sigP_gradient_Cx_vs_complex_step

So far, I can't find a problem in eqCcyle_v2.m The DIC gradient equation for sigP is exactly the same as the equation for all the other P model parameters.

RHS(1:nwet,pindx.lsigP) = -d0((1-sigC-gamma)*RR*Gx(:,pindx.lsigP))*C2P;

Are there any terms in this equation that would be functions of sigP (but not of the other P model parameters)?

C2P only depends on the observed phosphate, not the model DIP, so this should not be a function of any P model parameters
sigC and gamma are constants (sigC is a optimizable parameter; gamma = 0)
RR shouldn't depend on sigP either

That leaves Gx as the only term that depends on sigP

Is it possible that Gx(:,pindx.lsigP) has an error?

Gx(:,pindx.lsigP) appears in the gradient for 6 of the 7 carbon tracers, but only the DIC gradient is very far from the complex step solution
If Gx(:,pindx.lsigP) is wrong, it's strange that it wouldn't cause problems in any of the other tracer equations.
- how could this happen?
  1. Maybe a tiny error exists in all the equations, but gets amplified in the DIC equation only
  2. The error might be present in all equations but gets multiplied by zero or cancels out in the other tracer equations, so it doesn't affect their gradients
- Neither of these explanations seem to be the case here

I'm still looking into it, but I'm getting pretty stumped. @Hojongseo , do you have any other ideas of where the problem might be?

Right, I agree with you. As we talked after the BGC seminar, the equation is exactly the same for all P parameters, so I also don't know why the problem only occurs with the gradient of sigP.

I checked the code for Gx if there was a error, but I think there is nothing wrong with Gx. Gx = GpDIPx + d0(DIP)Gpx In this equation, the second term of RHS is zero, except Gpx(:,ialpha) and Gpx(:,ibeta), so it does not seem to be related to sigP. In addition, since there is also no errors on calculating DIPx (results of the complex step method for eqPcycle was fine), it seems the calculation of Gx is correct. For the double check, I also see the DIPx, which means Px(1:nwet,:) in eqPcycle, and it looks fine to me.

Although I cannot find where is the problem, I'll continue to debug it and let you know if I find anything. Thank you for your efforts on finding an issue on sigP!

After looking into it further, I don't think the problem is in the sigP equations after all. I ran a few GHtests of different parameter pairs (results below) and found the discrepancy between the explicit gradient and the complex step method occurred for whichever P model parameter was listed first in pindx. So the error I was seeing in the sigP gradient seems to occur due to its position in the vector of optimizable parameters, rather than an error in the gradient equations for sigP.

Initially, I thought this might be caused by the way we pack the gradients in Cx, but I couldn't find any issues with the ordering of the columns in Cx.

I now suspect the issue is due to how we set the initial guess for the Cmodel solution in the global variable GC.

GHtest results for different parameter pairs

Q10P & sigC on

Tunable parameters 
-----------------------------------------------
current Q10P    is   2.28e+00 
current sigC    is   9.00e-02

gradient relative error (1): -1.791e-03  
-1.130e-02   4.472e-04  

gradient relative error (2): -1.112e-14  
 2.871e-13   6.831e-15

kdP & alpha & sigC

Tunable parameters 
-----------------------------------------------
current kdP     is   1.86e-08 
current alpha   is   2.45e-08 
current sigC    is   9.00e-02

gradient relative error (1): -1.474e-03  
-3.571e-03  -2.646e-02   4.932e-05  

gradient relative error (2):  2.760e-12  
 4.937e-12   6.135e-12   2.159e-12

gradient relative error (3): -6.179e-16  
 2.775e-14   1.367e-13   9.657e-15

alpha & sigC on

Tunable parameters 
-----------------------------------------------
current alpha   is   2.45e-08 
current sigC    is   9.00e-02

gradient relative error (1): -1.576e-03  
-1.146e-02   2.049e-04  

gradient relative error (2): -8.856e-15  
 4.220e-13   7.066e-15

sigP & bC

Tunable parameters 
-----------------------------------------------
current sigP    is   3.50e-01 
current bC      is   6.50e-01

gradient relative error (1): -2.185e-03  
 2.228e-03  -1.742e-01  

-- OUT_OF_MEMORY -- 
Restarting GHtest...

Tunable parameters 
-----------------------------------------------
current sigP    is   3.50e-01 
current bC      is   6.50e-01

gradient relative error (1): -1.746e-12  
 1.164e-13   1.694e-10

This slurm job ran out of memory. Upon restarting, the gradient relative error issue disappeared.
What was different?
- When restarting, the output file par.fname already existed, so this output was used to update the initial guess stored in the global variable GC.

Possible Explanation

The error between the explicit gradient and the complex step method seems to only occur when the Newton-Armijo solver in eqCcycle does many iterations (e.g. stops at iter=18 as opposed to iter=1), suggesting that the problem results from the initial guess being far from the actual solution.

I think the problem may be related to resetting the global variable GC after solving Ceqn in eqCcycle. Resetting GC to the steady-state solution could explain why the error only occurs in the first parameter in x. For each of these tests, the first iteration reads in the initial fields from the initCO file from DATA/BGC_2023Nature. Because each test has a unique name, no output file (fname) exists yet, so the initial guesses do not get updated using par.fname.

When running neglogpost for the first parameter in the GHtest loop, GC is not very close to the Cmodel steady-state solution, so the Newton-Armijo solver in eqCcycle took 18 iterations to find a solution. Then the global variable GC is reset to this new solution. In every subsequent iteration through neglogpost, the global variable is very close to the actual solution. Thus, for the rest of the parameters tested with GHtest, the initial value of GC is already at the steady-state solution, and the Newton-Armijo solver in eqCcycle finds a solution after just one iteration.

I'm not entirely sure why having the initial guess for GC far from the solution would cause the explicit gradient to differ from the complex step method. Maybe the explicit gradients depend on the DIC values in the initial global variable GC, while the complex step method uses the updated GC to compute the gradient. I can look into this further later.

Conclusions So Far

The gradient equation for sigP seems correct, but the gradients only match the complex step method when the initial guess is near the steady-state solution. We should be able to avoid this issue by solving the steady state model once before running the GHtest. I don't know if this impacts the fminunc function's ability to minimize the objective function using the computed gradient and Hessian. Let me know if you have any ideas.

fprimeau / OCIM-BGC

sigP gradient error #15