fprimeau / OCIM-BGC

Coupled offline biogeochemistry model for the cycling of P, C, Si, and O2
4 stars 3 forks source link

sigP gradient error #15

Open MeganRSullivan opened 1 month ago

MeganRSullivan commented 1 month ago

FYI, I found a possible issue with the sigP parameter gradient while testing the gradient and Hessian for the 24layer model with P, C, and O models on.

The Problem: The explicitly computed gradient w.r.t. sigP does not match the complex step method. But in the Hessian, when only the 2nd derivative is w.r.t. sigP, the Hessian is consistent with the complex step method. This suggests that the problem is probably in the gradient definition for sigP.

I tested this using my version of the code in the Code_withCellModel folder, but the issue occurred even when the cell model is set to off and rO2C is a constant parameter (in which case, the code should operate exactly the same as Code_24layer)

Steps Taken So far, I have tried to diagnose where the error is introduced by testing if the problem is in eqPcycle, eqOcycle, or eqCcyle

My conclusion so far is that there is a bug in eqCcycle_v2.m in the gradient with respect to sigP. I have not identified where exactly the issue is yet, but thought I would post it here, in case anyone else has encountered or fixed this issue in other versions of the BGC model.

MeganRSullivan commented 1 month ago

It looks like the error is in the DIC equation gradient w.r.t sigP in eqCcycle_v2.

test_sigP_gradient_Cx_vs_complex_step

MeganRSullivan commented 1 month ago

So far, I can't find a problem in eqCcyle_v2.m The DIC gradient equation for sigP is exactly the same as the equation for all the other P model parameters.

RHS(1:nwet,pindx.lsigP) = -d0((1-sigC-gamma)*RR*Gx(:,pindx.lsigP))*C2P;

Are there any terms in this equation that would be functions of sigP (but not of the other P model parameters)?

That leaves Gx as the only term that depends on sigP

Is it possible that Gx(:,pindx.lsigP) has an error?

I'm still looking into it, but I'm getting pretty stumped. @Hojongseo , do you have any other ideas of where the problem might be?

Hojongseo commented 1 month ago

Right, I agree with you. As we talked after the BGC seminar, the equation is exactly the same for all P parameters, so I also don't know why the problem only occurs with the gradient of sigP.

I checked the code for Gx if there was a error, but I think there is nothing wrong with Gx. Gx = GpDIPx + d0(DIP)Gpx In this equation, the second term of RHS is zero, except Gpx(:,ialpha) and Gpx(:,ibeta), so it does not seem to be related to sigP. In addition, since there is also no errors on calculating DIPx (results of the complex step method for eqPcycle was fine), it seems the calculation of Gx is correct. For the double check, I also see the DIPx, which means Px(1:nwet,:) in eqPcycle, and it looks fine to me.

Although I cannot find where is the problem, I'll continue to debug it and let you know if I find anything. Thank you for your efforts on finding an issue on sigP!

MeganRSullivan commented 1 month ago

After looking into it further, I don't think the problem is in the sigP equations after all. I ran a few GHtests of different parameter pairs (results below) and found the discrepancy between the explicit gradient and the complex step method occurred for whichever P model parameter was listed first in pindx. So the error I was seeing in the sigP gradient seems to occur due to its position in the vector of optimizable parameters, rather than an error in the gradient equations for sigP.

Initially, I thought this might be caused by the way we pack the gradients in Cx, but I couldn't find any issues with the ordering of the columns in Cx.

I now suspect the issue is due to how we set the initial guess for the Cmodel solution in the global variable GC.

GHtest results for different parameter pairs

Q10P & sigC on

Tunable parameters 
-----------------------------------------------
current Q10P    is   2.28e+00 
current sigC    is   9.00e-02

gradient relative error (1): -1.791e-03  
-1.130e-02   4.472e-04  

gradient relative error (2): -1.112e-14  
 2.871e-13   6.831e-15

kdP & alpha & sigC

Tunable parameters 
-----------------------------------------------
current kdP     is   1.86e-08 
current alpha   is   2.45e-08 
current sigC    is   9.00e-02

gradient relative error (1): -1.474e-03  
-3.571e-03  -2.646e-02   4.932e-05  

gradient relative error (2):  2.760e-12  
 4.937e-12   6.135e-12   2.159e-12

gradient relative error (3): -6.179e-16  
 2.775e-14   1.367e-13   9.657e-15

alpha & sigC on

Tunable parameters 
-----------------------------------------------
current alpha   is   2.45e-08 
current sigC    is   9.00e-02

gradient relative error (1): -1.576e-03  
-1.146e-02   2.049e-04  

gradient relative error (2): -8.856e-15  
 4.220e-13   7.066e-15

sigP & bC

Tunable parameters 
-----------------------------------------------
current sigP    is   3.50e-01 
current bC      is   6.50e-01

gradient relative error (1): -2.185e-03  
 2.228e-03  -1.742e-01  

-- OUT_OF_MEMORY -- 
Restarting GHtest...

Tunable parameters 
-----------------------------------------------
current sigP    is   3.50e-01 
current bC      is   6.50e-01

gradient relative error (1): -1.746e-12  
 1.164e-13   1.694e-10

Possible Explanation

The error between the explicit gradient and the complex step method seems to only occur when the Newton-Armijo solver in eqCcycle does many iterations (e.g. stops at iter=18 as opposed to iter=1), suggesting that the problem results from the initial guess being far from the actual solution.

I think the problem may be related to resetting the global variable GC after solving Ceqn in eqCcycle. Resetting GC to the steady-state solution could explain why the error only occurs in the first parameter in x. For each of these tests, the first iteration reads in the initial fields from the initCO file from DATA/BGC_2023Nature. Because each test has a unique name, no output file (fname) exists yet, so the initial guesses do not get updated using par.fname.

When running neglogpost for the first parameter in the GHtest loop, GC is not very close to the Cmodel steady-state solution, so the Newton-Armijo solver in eqCcycle took 18 iterations to find a solution. Then the global variable GC is reset to this new solution. In every subsequent iteration through neglogpost, the global variable is very close to the actual solution. Thus, for the rest of the parameters tested with GHtest, the initial value of GC is already at the steady-state solution, and the Newton-Armijo solver in eqCcycle finds a solution after just one iteration.

I'm not entirely sure why having the initial guess for GC far from the solution would cause the explicit gradient to differ from the complex step method. Maybe the explicit gradients depend on the DIC values in the initial global variable GC, while the complex step method uses the updated GC to compute the gradient. I can look into this further later.

Conclusions So Far

The gradient equation for sigP seems correct, but the gradients only match the complex step method when the initial guess is near the steady-state solution. We should be able to avoid this issue by solving the steady state model once before running the GHtest. I don't know if this impacts the fminunc function's ability to minimize the objective function using the computed gradient and Hessian. Let me know if you have any ideas.