GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
27 stars 18 forks source link

Multi-day GEOS Runs are non-zero-diff with latest MAPL develop #961

Closed mathomp4 closed 3 years ago

mathomp4 commented 3 years ago

So, this is interesting. Last night, my tests showed that SCM with MAPL develop (1726321964680cdfdc43eed418f502937076b649) is now non-zero-diff with MAPL v2.8.0. Odd. I did some more tests and found something interesting. The model itself is non-zero-diff as well, but you have to run more than one day. Indeed, it seems you have to run a while. I set up some 10-day runs (I was going to do 1 week but my makeoneday script has a 10day option) and:

$ cmphistory stock-gcm-2021Aug13-1week-c24/ mapldev-gcm-2021Aug13-1week-c24/
Comparing geosgcm_buda.20000414_2230z...
Success!
Comparing geosgcm_buda.20000415_0130z...
Success!
Comparing geosgcm_buda.20000415_0430z...
Success!
Comparing geosgcm_buda.20000415_0730z...
Success!
...
Comparing geosgcm_buda.20000419_2230z...
Success!
Comparing geosgcm_buda.20000420_0130z...
Failure!
Comparing geosgcm_buda.20000420_0430z...
Failure!
...

or with our friend prog:

Comparing geosgcm_prog.20000415_0000z...
Success!
Comparing geosgcm_prog.20000415_0600z...
Success!
Comparing geosgcm_prog.20000415_1200z...
Success!
Comparing geosgcm_prog.20000415_1800z...
Success!
...
Comparing geosgcm_prog.20000419_1800z...
Success!
Comparing geosgcm_prog.20000420_0000z...
Success!
Comparing geosgcm_prog.20000420_0600z...
Failure!
Comparing geosgcm_prog.20000420_1200z...
Failure!
Comparing geosgcm_prog.20000420_1800z...
Failure!
...

Now, to check, I also did some experiments with the previous MAPL develop (6bf423d3ff4c9dee2a3e3e2cb908ccf5b82b01e5) and it is zero-diff.

So something in https://github.com/GEOS-ESM/MAPL/compare/6bf423d3ff4c9dee2a3e3e2cb908ccf5b82b01e5...1726321964680cdfdc43eed418f502937076b649 seems to cause the model to be non-zero-diff after ... 5, 6 days?? And the only change in there that should affect GEOS is Constants refactor.

I'm assigning pretty much the whole team on this (@tclune , @bena-nasa , @atrayano) as well as @Gvilla1000-nasa as I have no idea what process in GEOS only happens after this long. ExtData??

bena-nasa commented 3 years ago

Any idea exactly what the cutoff time is to go non-zero diff?

tclune commented 3 years ago

Before staring too hard at code, hopefully @atrayano or @bena-nasa can conjecture what processes perhaps do not run until so far into the model? We should then be able to see what constants are used there.

mathomp4 commented 3 years ago

Any idea exactly what the cutoff time is to go non-zero diff?

Well, my restarts are at 04-14 21z and budi was non-zero-diff at 04-20 0130z. If I look at the log, the only interesting thing for 00z on the 20th is:

...
 AGCM Date: 2000/04/20  Time: 00:00:00  Throughput(days/day)[Avg Tot Run]:   1642.9    487.1   1645.9  TimeRemaining(Est) 000:04:16   24.0% :  10.6% Mem Comm:Used
        EXTDATA: INFO: Updating R bracket for BC_BIOMASS
        EXTDATA: INFO:  ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_bc.005.20000421.nc4
        EXTDATA: INFO: Updating R bracket for CO_BIOMASS
        EXTDATA: INFO:  ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_co.005.20000421.nc4
        EXTDATA: INFO: Updating R bracket for EMI_NH3_BB
        EXTDATA: INFO:  ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_nh3.005.20000421.nc4
        EXTDATA: INFO: Updating R bracket for OC_BIOMASS
        EXTDATA: INFO:  ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_oc.005.20000421.nc4
        EXTDATA: INFO: Updating R bracket for SU_BIOMASS
        EXTDATA: INFO:  ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_so2.005.20000421.nc4

 PCHEM::Run: Indices selected from climatology are          256         257

 Updated CO2 in solar for year/day 2000/111 is  0.36857E-03

 AGCM Date: 2000/04/20  Time: 00:15:00  Throughput(days/day)[Avg Tot Run]:   1642.8   1589.8   2300.7  TimeRemaining(Est) 000:04:15   24.0% :  10.6% Mem Comm:Used
 AGCM Date: 2000/04/20  Time: 00:30:00  Throughput(days/day)[Avg Tot Run]:   1643.0   1799.1   1829.1  TimeRemaining(Est) 000:04:15   24.0% :  10.6% Mem Comm:Used
...

But QFED is updated every day as are the solar constants.

mathomp4 commented 3 years ago

Note this seems to be true GOCART or GOCART.data.

atrayano commented 3 years ago

Since single column run are fast, I am thinking we should write intermediate checkpoints often to bracket the first non-zero diff.

bena-nasa commented 3 years ago

Before staring too hard at code, hopefully @atrayano or @bena-nasa can conjecture what processes perhaps do not run until so far into the model? We should then be able to see what constants are used there.

I really can't think of something that wouldn't happen until 5 days or so into the run. Certainly the inputs can change (ExtData inputs, ReadForcing inputs, the pchem species file, volcanoes) but I'm having a hard time seeing how those could function one way, then be processed in a differed way that causes non-zero diff.

atrayano commented 3 years ago

I am thinking if the solar constant slowly evolves with time, this potentially could explain it. But this is a very long shot

tclune commented 3 years ago

On a related thought. Maybe there is always roundoff somewhere due to a small change in a constant. But that roundoff usually ends up back at 0-diff after more calculations, but not always. E.g., some larger number is accumulating a delta where the delta differs in the last bit. Usually that will not affect the sum because the larger number is, er, larger.

Not that this is particularly helpful tracking it down.

@Gvilla1000-nasa could you think about your changes and make an educated guess as to which variables are the most likely to have had an undetected roundoff change? On that thought, maybe write a short program that writes out all the constants into a file. And then another that does the same with the new constants. Has to either be binary, or you have to be very careful to capture all bits in formatted output.

Gvilla1000-nasa commented 3 years ago

Only variable I can really think of would be MAPL_EARTH_ECCENTRICITY, MAPL_OMEGA, and those places where I changed the hardcoded definition of PI to use MAPL_PI or MAPL_PI_R8. I'll write up a program to do as suggested and report back.

mathomp4 commented 3 years ago

MAPL_OMEGA does seem to be used in FV3:


$ rg omega @GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/@FVdycoreCubed_GridComp/FV_StateMod.F90 -M0
77:  public fv_getOmega
208:    real(REAL8)                             :: omega    ! angular velocity of earth's rotation
259: subroutine FV_RESET_CONSTANTS(FV_PI, FV_OMEGA, FV_CP, FV_RADIUS, FV_RGAS, &
262:   real (REAL8), optional, intent(IN) :: FV_OMEGA
275:   if (present(FV_OMEGA)) then
276:      omega = FV_OMEGA
440:    omega  = MAPL_OMEGA    ! angular velocity of earth's rotation
870:     fv_atm(1)%gridstruct%fC(:,:) = 2.*MAPL_OMEGA*sin(FV_Atm(1)%flagstruct%deglat/180.*MAPL_PI_R8)
871:     fv_atm(1)%gridstruct%f0(:,:) = 2.*MAPL_OMEGA*sin(FV_Atm(1)%flagstruct%deglat/180.*MAPL_PI_R8)
879:           fv_atm(1)%gridstruct%fC(i,j) = 2.*MAPL_OMEGA*( -COS(FV_Atm(1)%gridstruct%grid(i,j,1))*COS(FV_Atm(1)%gridstruct%grid(i,j,2))*SIN(GRID%f_coriolis_angle) + &
885:           fv_atm(1)%gridstruct%f0(i,j) = 2.*MAPL_OMEGA*( -COS(FV_Atm(1)%gridstruct%agrid(i,j,1))*COS(FV_Atm(1)%gridstruct%agrid(i,j,2))*SIN(GRID%f_coriolis_angle) + &
3418:subroutine fv_getOmega(omga)
3428:end subroutine fv_getOmega
3701:                                         2.*MAPL_OMEGA*COS(FV_Atm(1)%gridstruct%agrid(i,j,2))
4229:!   logical :: use_old_omega = .true.
4265:   call WRITE_PARALLEL ( FV_Atm(1)%flagstruct%nf_omega ,format='("FV3 nf_omega: ",(I7))' )```
mathomp4 commented 3 years ago

One note: All I can say is that the before-constant-refactor code is zero-diff after 10 days. If we have some month-long thing...

I suppose I should now look at longer runs in my nightly tests. If we can pin down what caused this, maybe a...1-week run of GEOS at c24 comparing stock to MAPL develop is a good thing. We could afford it on Discover.

mathomp4 commented 3 years ago

Since single column run are fast, I am thinking we should write intermediate checkpoints often to bracket the first non-zero diff.

That could be doable. I can probably look at doing it next week. Would need to brush up on my AGCM.rc RECORD knowledge :)

bena-nasa commented 3 years ago

Once we have an approximate time it goes non-zero diff hopefully then we can stop a step or two before and from that the executables still go non-zero diff from that time.

Gvilla1000-nasa commented 3 years ago

Based on the tests I performed, it started going non-zero diff after 5days and 5 hours. I ran the model for 5 days and 3hrs and used the output of that time as the new restart time and ran it for 1hr and it was zero-diff. When I ran it for 2hrs it started being non-zero diff matching the 5days 5hr initial run.

atrayano commented 3 years ago

Gian, thanks for bracketing the issue. So it looks like the last known zero-diff is at 5days and 4 hours. I would suggest we continue the bracketing at much finer scale, i.e. at the heartbeat of the clock. If you have all the checkpoints, you should start at 5days 4 hours (otherwise use your 5days 3 hours restarts). You could either run each iteration for only one time step. Or you you run for an hour, but dump intermediate checkpoints every time step. The idea is to identify the last time when the restarts were still zero diff. It still might be after 5days 5 hours, but there is no guarantee it does not happen between the 4th and 5th hour. Let me know if you need instructions how to dump intermediate checkpoints.

Gvilla1000-nasa commented 3 years ago

Instructions on dumping intermediate checkpoints would be helpful. I saw in the gcm_run.j about them being dumped into the restarts directory but that was empty when I checked.

bena-nasa commented 3 years ago

Gian, Uncomment these lines in your AGCM.rc:

#RECORD_FINAL:  >>>RECFINL<<<
#RECORD_FREQUENCY: 000000       000000
#RECORD_REF_DATE: >>>REFDATE<<< >>>FCSDATE<<<
#RECORD_REF_TIME: >>>REFTIME<<< >>>FCSTIME<<<

these allow you specify a list of reference date and times you to output checkpoints during the run. So if you wanted them say once a day from the start time, lets say you started the model at 04012005 210000 you would set this:

RECORD_FINAL:  YES
RECORD_FREQUENCY: 240000
RECORD_REF_DATE: 04012005
RECORD_REF_TIME: 210000

if you want them at other frequencies you can add to the list. Note, I also think you can omit the REF_DATE/REF_TIME and it would use the start time of the model, but I'm not 100% sure about this.

For what you want, the RECORD_FREQUENCY would be the model heartbeat that is set in the CAP.rc

Gvilla1000-nasa commented 3 years ago

Thanks! This situation is getting more complicated by the minute haha.

I started at day 5 hour 4 and started outputting intermediate checkpoints every 10mins (so I increased the heartbeat from default 900 to 600). This resulted in non-zero diffs beginning to show up after 1hr. This matches the day 5 hour 5 shown before (yay).

However, when I increased the heartbeat to 300 so I could see if the problem started perhaps between 50min to the hour mark, it resulted in non-zero diffs starting from 30mins into the run.

Gvilla1000-nasa commented 3 years ago

I should also mention, that I also performed a new test after doing @tclune suggestion of writing the small program to see the differences in constants. Doing that resulted in the following variables being different (MAPL_EARTH_ECCENTRICITY, MAPL_DEG_PER_KM, MAPL_KM_PER_DEG, MAPL_OMEGA_R8). Most if not all of them are isolated to the NominalOrbits module. Reverting these variables to their previous state and re-running a test case kept resulting in differences popping up at the same time of 5 days and 5hrs. Thus, it doesn't look like the physical constants are the culprit... My next thought is perhaps one of the UNDEFINED internal constants.

bena-nasa commented 3 years ago

NominalOrbits is not used so can't be the culprit as you showed.

mathomp4 commented 3 years ago

5d 5h is so weird. I mean, I suppose 125 (hours) is...5 cubed? So baffling. What is it about that point in time? Such a mystery.

mathomp4 commented 3 years ago

Solved and closing!