Closed mathomp4 closed 3 years ago
Any idea exactly what the cutoff time is to go non-zero diff?
Before staring too hard at code, hopefully @atrayano or @bena-nasa can conjecture what processes perhaps do not run until so far into the model? We should then be able to see what constants are used there.
Any idea exactly what the cutoff time is to go non-zero diff?
Well, my restarts are at 04-14 21z and budi was non-zero-diff at 04-20 0130z. If I look at the log, the only interesting thing for 00z on the 20th is:
...
AGCM Date: 2000/04/20 Time: 00:00:00 Throughput(days/day)[Avg Tot Run]: 1642.9 487.1 1645.9 TimeRemaining(Est) 000:04:16 24.0% : 10.6% Mem Comm:Used
EXTDATA: INFO: Updating R bracket for BC_BIOMASS
EXTDATA: INFO: ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_bc.005.20000421.nc4
EXTDATA: INFO: Updating R bracket for CO_BIOMASS
EXTDATA: INFO: ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_co.005.20000421.nc4
EXTDATA: INFO: Updating R bracket for EMI_NH3_BB
EXTDATA: INFO: ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_nh3.005.20000421.nc4
EXTDATA: INFO: Updating R bracket for OC_BIOMASS
EXTDATA: INFO: ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_oc.005.20000421.nc4
EXTDATA: INFO: Updating R bracket for SU_BIOMASS
EXTDATA: INFO: ... file processed: ExtData/PIESA/sfc/QFED/v2.4r6/Y2000/M04/qfed2.emis_so2.005.20000421.nc4
PCHEM::Run: Indices selected from climatology are 256 257
Updated CO2 in solar for year/day 2000/111 is 0.36857E-03
AGCM Date: 2000/04/20 Time: 00:15:00 Throughput(days/day)[Avg Tot Run]: 1642.8 1589.8 2300.7 TimeRemaining(Est) 000:04:15 24.0% : 10.6% Mem Comm:Used
AGCM Date: 2000/04/20 Time: 00:30:00 Throughput(days/day)[Avg Tot Run]: 1643.0 1799.1 1829.1 TimeRemaining(Est) 000:04:15 24.0% : 10.6% Mem Comm:Used
...
But QFED is updated every day as are the solar constants.
Note this seems to be true GOCART or GOCART.data.
Since single column run are fast, I am thinking we should write intermediate checkpoints often to bracket the first non-zero diff.
Before staring too hard at code, hopefully @atrayano or @bena-nasa can conjecture what processes perhaps do not run until so far into the model? We should then be able to see what constants are used there.
I really can't think of something that wouldn't happen until 5 days or so into the run. Certainly the inputs can change (ExtData inputs, ReadForcing inputs, the pchem species file, volcanoes) but I'm having a hard time seeing how those could function one way, then be processed in a differed way that causes non-zero diff.
I am thinking if the solar constant slowly evolves with time, this potentially could explain it. But this is a very long shot
On a related thought. Maybe there is always roundoff somewhere due to a small change in a constant. But that roundoff usually ends up back at 0-diff after more calculations, but not always. E.g., some larger number is accumulating a delta where the delta differs in the last bit. Usually that will not affect the sum because the larger number is, er, larger.
Not that this is particularly helpful tracking it down.
@Gvilla1000-nasa could you think about your changes and make an educated guess as to which variables are the most likely to have had an undetected roundoff change? On that thought, maybe write a short program that writes out all the constants into a file. And then another that does the same with the new constants. Has to either be binary, or you have to be very careful to capture all bits in formatted output.
Only variable I can really think of would be MAPL_EARTH_ECCENTRICITY
, MAPL_OMEGA
, and those places where I changed the hardcoded definition of PI to use MAPL_PI
or MAPL_PI_R8
. I'll write up a program to do as suggested and report back.
MAPL_OMEGA
does seem to be used in FV3:
$ rg omega @GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/@FVdycoreCubed_GridComp/FV_StateMod.F90 -M0
77: public fv_getOmega
208: real(REAL8) :: omega ! angular velocity of earth's rotation
259: subroutine FV_RESET_CONSTANTS(FV_PI, FV_OMEGA, FV_CP, FV_RADIUS, FV_RGAS, &
262: real (REAL8), optional, intent(IN) :: FV_OMEGA
275: if (present(FV_OMEGA)) then
276: omega = FV_OMEGA
440: omega = MAPL_OMEGA ! angular velocity of earth's rotation
870: fv_atm(1)%gridstruct%fC(:,:) = 2.*MAPL_OMEGA*sin(FV_Atm(1)%flagstruct%deglat/180.*MAPL_PI_R8)
871: fv_atm(1)%gridstruct%f0(:,:) = 2.*MAPL_OMEGA*sin(FV_Atm(1)%flagstruct%deglat/180.*MAPL_PI_R8)
879: fv_atm(1)%gridstruct%fC(i,j) = 2.*MAPL_OMEGA*( -COS(FV_Atm(1)%gridstruct%grid(i,j,1))*COS(FV_Atm(1)%gridstruct%grid(i,j,2))*SIN(GRID%f_coriolis_angle) + &
885: fv_atm(1)%gridstruct%f0(i,j) = 2.*MAPL_OMEGA*( -COS(FV_Atm(1)%gridstruct%agrid(i,j,1))*COS(FV_Atm(1)%gridstruct%agrid(i,j,2))*SIN(GRID%f_coriolis_angle) + &
3418:subroutine fv_getOmega(omga)
3428:end subroutine fv_getOmega
3701: 2.*MAPL_OMEGA*COS(FV_Atm(1)%gridstruct%agrid(i,j,2))
4229:! logical :: use_old_omega = .true.
4265: call WRITE_PARALLEL ( FV_Atm(1)%flagstruct%nf_omega ,format='("FV3 nf_omega: ",(I7))' )```
One note: All I can say is that the before-constant-refactor code is zero-diff after 10 days. If we have some month-long thing...
I suppose I should now look at longer runs in my nightly tests. If we can pin down what caused this, maybe a...1-week run of GEOS at c24 comparing stock to MAPL develop
is a good thing. We could afford it on Discover.
Since single column run are fast, I am thinking we should write intermediate checkpoints often to bracket the first non-zero diff.
That could be doable. I can probably look at doing it next week. Would need to brush up on my AGCM.rc RECORD knowledge :)
Once we have an approximate time it goes non-zero diff hopefully then we can stop a step or two before and from that the executables still go non-zero diff from that time.
Based on the tests I performed, it started going non-zero diff after 5days and 5 hours. I ran the model for 5 days and 3hrs and used the output of that time as the new restart time and ran it for 1hr and it was zero-diff. When I ran it for 2hrs it started being non-zero diff matching the 5days 5hr initial run.
Gian, thanks for bracketing the issue. So it looks like the last known zero-diff is at 5days and 4 hours. I would suggest we continue the bracketing at much finer scale, i.e. at the heartbeat of the clock. If you have all the checkpoints, you should start at 5days 4 hours (otherwise use your 5days 3 hours restarts). You could either run each iteration for only one time step. Or you you run for an hour, but dump intermediate checkpoints every time step. The idea is to identify the last time when the restarts were still zero diff. It still might be after 5days 5 hours, but there is no guarantee it does not happen between the 4th and 5th hour. Let me know if you need instructions how to dump intermediate checkpoints.
Instructions on dumping intermediate checkpoints would be helpful. I saw in the gcm_run.j about them being dumped into the restarts directory but that was empty when I checked.
Gian, Uncomment these lines in your AGCM.rc:
#RECORD_FINAL: >>>RECFINL<<<
#RECORD_FREQUENCY: 000000 000000
#RECORD_REF_DATE: >>>REFDATE<<< >>>FCSDATE<<<
#RECORD_REF_TIME: >>>REFTIME<<< >>>FCSTIME<<<
these allow you specify a list of reference date and times you to output checkpoints during the run. So if you wanted them say once a day from the start time, lets say you started the model at 04012005 210000 you would set this:
RECORD_FINAL: YES
RECORD_FREQUENCY: 240000
RECORD_REF_DATE: 04012005
RECORD_REF_TIME: 210000
if you want them at other frequencies you can add to the list. Note, I also think you can omit the REF_DATE/REF_TIME and it would use the start time of the model, but I'm not 100% sure about this.
For what you want, the RECORD_FREQUENCY would be the model heartbeat that is set in the CAP.rc
Thanks! This situation is getting more complicated by the minute haha.
I started at day 5 hour 4 and started outputting intermediate checkpoints every 10mins (so I increased the heartbeat from default 900 to 600). This resulted in non-zero diffs beginning to show up after 1hr. This matches the day 5 hour 5 shown before (yay).
However, when I increased the heartbeat to 300 so I could see if the problem started perhaps between 50min to the hour mark, it resulted in non-zero diffs starting from 30mins into the run.
I should also mention, that I also performed a new test after doing @tclune suggestion of writing the small program to see the differences in constants. Doing that resulted in the following variables being different (MAPL_EARTH_ECCENTRICITY
, MAPL_DEG_PER_KM
, MAPL_KM_PER_DEG
, MAPL_OMEGA_R8
). Most if not all of them are isolated to the NominalOrbits module. Reverting these variables to their previous state and re-running a test case kept resulting in differences popping up at the same time of 5 days and 5hrs. Thus, it doesn't look like the physical constants are the culprit... My next thought is perhaps one of the UNDEFINED
internal constants.
NominalOrbits is not used so can't be the culprit as you showed.
5d 5h is so weird. I mean, I suppose 125 (hours) is...5 cubed? So baffling. What is it about that point in time? Such a mystery.
Solved and closing!
So, this is interesting. Last night, my tests showed that SCM with MAPL
develop
(1726321964680cdfdc43eed418f502937076b649) is now non-zero-diff with MAPL v2.8.0. Odd. I did some more tests and found something interesting. The model itself is non-zero-diff as well, but you have to run more than one day. Indeed, it seems you have to run a while. I set up some 10-day runs (I was going to do 1 week but my makeoneday script has a 10day option) and:or with our friend prog:
Now, to check, I also did some experiments with the previous MAPL
develop
(6bf423d3ff4c9dee2a3e3e2cb908ccf5b82b01e5) and it is zero-diff.So something in https://github.com/GEOS-ESM/MAPL/compare/6bf423d3ff4c9dee2a3e3e2cb908ccf5b82b01e5...1726321964680cdfdc43eed418f502937076b649 seems to cause the model to be non-zero-diff after ... 5, 6 days?? And the only change in there that should affect GEOS is Constants refactor.
I'm assigning pretty much the whole team on this (@tclune , @bena-nasa , @atrayano) as well as @Gvilla1000-nasa as I have no idea what process in GEOS only happens after this long. ExtData??