E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

In F2010 and F1850, RUN_STARTDATE is a climate-changing variable #5897

Open mahf708 opened 1 year ago

mahf708 commented 1 year ago

Steps to reproduce in production (done on 78a5ea1):

  1. Take an F2010 run script (e.g., this)
  2. In one, set START_DATE to 0001-01-01
  3. In another, set START_DATE to 2009-01-01
  4. Verify diff (note WALLTIME was adjusted to 04:00:00 for the first experiment before submission)
    22c22
    < readonly CASE_NAME="F2010_test_no_volc_nml_year2009"
    ---
    > readonly CASE_NAME="F2010_test_no_volc_nml"
    32c32
    < readonly START_DATE="2009-01-01"
    ---
    > readonly START_DATE="0001-01-01"
    85c85
    <   readonly WALLTIME="04:00:00"
    ---
    >   readonly WALLTIME="01:00:00"
    654d653
    < 

The diff of the start of atm logs with a shorter snippet below: https://gist.github.com/mahf708/88627768a589ea4f118819c8141d56f9

diff of atm logs

```diff 34c234 < ********** CASE = F2010_test_no_volc_nml ********** --- > ********** CASE = F2010_test_no_volc_nml_year2009 ********** 250c250 < Start date (yr mon day tod): 1 1 1 --- > Start date (yr mon day tod): 2009 1 1 252c252 < Stop date (yr mon day tod): 12 1 1 --- > Stop date (yr mon day tod): 2020 1 1 254c254 < Reference date (yr mon day tod): 1 1 1 --- > Reference date (yr mon day tod): 2009 1 1 257c257 < Current date (yr mon day tod): 1 1 1 --- > Current date (yr mon day tod): 2009 1 1 3213c3213 < chlorine_loading_advance: date, loading : 0001-01-01-00000, 3.286484 --- > chlorine_loading_advance: date, loading : 2009-01-01-00000, 3.286484 10344c10344 < chem_surfvals_set: ncdate= 10101 co2vmr= 3.887170000000000E-004 --- > chem_surfvals_set: ncdate= 20090101 co2vmr= 3.887170000000000E-004 10678c10678 < WSHIST: writing time sample 0 to h-file 2 DATE=0001/01/01 NCSEC= 0 --- > WSHIST: writing time sample 0 to h-file 2 DATE=2009/01/01 NCSEC= 0 10680,10682c10680,10684 < WSHIST: nhfil( 2 )=F2010_test_no_volc_nml.eam.h1.0001-01-01-00000.nc < Opening netcdf history file F2010_test_no_volc_nml.eam.h1.0001-01-01-00000.nc < Opened file F2010_test_no_volc_nml.eam.h1.0001-01-01-00000.nc to write --- > WSHIST: nhfil( 2 )= > F2010_test_no_volc_nml_year2009.eam.h1.2009-01-01-00000.nc > Opening netcdf history file > F2010_test_no_volc_nml_year2009.eam.h1.2009-01-01-00000.nc > Opened file F2010_test_no_volc_nml_year2009.eam.h1.2009-01-01-00000.nc to write 10686c10688 < WSHIST: writing time sample 0 to h-file 3 DATE=0001/01/01 NCSEC= 0 --- > WSHIST: writing time sample 0 to h-file 3 DATE=2009/01/01 NCSEC= 0 10688,10690c10690,10694 < WSHIST: nhfil( 3 )=F2010_test_no_volc_nml.eam.h2.0001-01-01-00000.nc < Opening netcdf history file F2010_test_no_volc_nml.eam.h2.0001-01-01-00000.nc < Opened file F2010_test_no_volc_nml.eam.h2.0001-01-01-00000.nc to write --- > WSHIST: nhfil( 3 )= > F2010_test_no_volc_nml_year2009.eam.h2.2009-01-01-00000.nc > Opening netcdf history file > F2010_test_no_volc_nml_year2009.eam.h2.2009-01-01-00000.nc > Opened file F2010_test_no_volc_nml_year2009.eam.h2.2009-01-01-00000.nc to write 10693c10697 < WSHIST: writing time sample 0 to h-file 4 DATE=0001/01/01 NCSEC= 0 --- > WSHIST: writing time sample 0 to h-file 4 DATE=2009/01/01 NCSEC= 0 10695,10697c10699,10703 < WSHIST: nhfil( 4 )=F2010_test_no_volc_nml.eam.h3.0001-01-01-00000.nc < Opening netcdf history file F2010_test_no_volc_nml.eam.h3.0001-01-01-00000.nc < Opened file F2010_test_no_volc_nml.eam.h3.0001-01-01-00000.nc to write --- > WSHIST: nhfil( 4 )= > F2010_test_no_volc_nml_year2009.eam.h3.2009-01-01-00000.nc > Opening netcdf history file > F2010_test_no_volc_nml_year2009.eam.h3.2009-01-01-00000.nc > Opened file F2010_test_no_volc_nml_year2009.eam.h3.2009-01-01-00000.nc to write 10700c10706 < WSHIST: writing time sample 0 to h-file 5 DATE=0001/01/01 NCSEC= 0 --- > WSHIST: writing time sample 0 to h-file 5 DATE=2009/01/01 NCSEC= 0 10702,10704c10708,10712 < WSHIST: nhfil( 5 )=F2010_test_no_volc_nml.eam.h4.0001-01-01-00000.nc < Opening netcdf history file F2010_test_no_volc_nml.eam.h4.0001-01-01-00000.nc < Opened file F2010_test_no_volc_nml.eam.h4.0001-01-01-00000.nc to write --- > WSHIST: nhfil( 5 )= > F2010_test_no_volc_nml_year2009.eam.h4.2009-01-01-00000.nc > Opening netcdf history file > F2010_test_no_volc_nml_year2009.eam.h4.2009-01-01-00000.nc > Opened file F2010_test_no_volc_nml_year2009.eam.h4.2009-01-01-00000.nc to write 10736c10744 < qv( 17)= 0.0000000000000000E+00 0.1915780446814151E-06 0.1834820218425010E-03 --- > qv( 17)= 0.0000000000000000E+00 0.1915780446814151E-06 0.1834820218425011E-03 10744,10746c10752,10754 < qv( 25)= 0.2690039768086822E-15 0.1724747405141781E-07 0.5063169060104195E-03 < qv( 26)= 0.4455513862534932E-16 0.1592551714958896E-07 0.1255561218290996E-03 < qv( 27)= 0.0000000000000000E+00 0.2956462582015114E-07 0.4475410065504301E-04 --- > qv( 25)= 0.2690039768086822E-15 0.1724747405141785E-07 0.5063169060104195E-03 > qv( 26)= 0.4455513862534932E-16 0.1592551714958900E-07 0.1255561218290996E-03 > qv( 27)= 0.0000000000000000E+00 0.2956462582015133E-07 0.4475410065504302E-04 10751c10759 < qv( 32)= 0.1869166552741963E-16 0.1505877065235611E-07 0.9959232728169855E-04 --- > qv( 32)= 0.1869166552741963E-16 0.1505877065235573E-07 0.9959232728169855E-04 10753,10754c10761,10762 < qv( 34)= 0.0000000000000000E+00 0.1238322082219041E-06 0.7284766557667819E-04 < qv( 35)= 0.0000000000000000E+00 0.1943175807384527E-06 0.6787432057585720E-04 --- > qv( 34)= 0.0000000000000000E+00 0.1238322082218666E-06 0.7284766557667793E-04 > qv( 35)= 0.0000000000000000E+00 0.1943175807384439E-06 0.6787432057585737E-04 10756c10764 < qv( 37)= 0.0000000000000000E+00 0.1474878429382859E-07 0.7517829343309258E-05 --- > qv( 37)= 0.0000000000000000E+00 0.1474878429382813E-07 0.7517829343309341E-05 10764,10776c10772,10784 < qv( 45)= 0.0000000000000000E+00 0.5068296109988129E-08 0.2759228652183142E-04 < qv( 46)= 0.2191339308137178E-20 0.1963561000976356E-06 0.3243422512034228E-03 < qv( 47)= 0.0000000000000000E+00 0.4638231528358731E-09 0.4822434525974767E-06 < qv( 48)= 0.0000000000000000E+00 0.3158315480417403E-06 0.1917266828724892E-03 < qv( 49)= 0.2315127743658109E-23 0.6006248269595660E-07 0.1204828322182552E-03 < qv( 50)= 0.7572197587567722E-22 0.2037030797047816E-07 0.6811729842309602E-04 < qv( 51)= 0.0000000000000000E+00 0.2825454884955962E-09 0.4560204117498490E-05 < qv( 52)= 0.0000000000000000E+00 0.1111626274560953E-08 0.1243698046534500E-04 < qv( 53)= 0.3129344538902286E-26 0.7476074247379550E-08 0.5144196759571508E-04 < qv( 54)= 0.5969168906385217E-20 0.1600635694817379E-07 0.8501734607568417E-04 < qv( 55)= 0.1224888580310986E-15 0.2481713009214718E-07 0.1939361380241828E-03 < qv( 56)= 0.4811573798262957E-18 0.1285645310793218E-07 0.3005787393812242E-03 < qv( 57)= 0.6729270980557783E-18 0.1815429696434390E-08 0.2931609932237145E-04 --- > qv( 45)= 0.0000000000000000E+00 0.5068296109988118E-08 0.2759228652183142E-04 > qv( 46)= 0.2191339308137178E-20 0.1963561000976356E-06 0.3243422512034227E-03 > qv( 47)= 0.0000000000000000E+00 0.4638231528358731E-09 0.4822434525974766E-06 > qv( 48)= 0.0000000000000000E+00 0.3158315480417385E-06 0.1917266828724892E-03 > qv( 49)= 0.2315127743658109E-23 0.6006248269595643E-07 0.1204828322183100E-03 > qv( 50)= 0.7572197587567722E-22 0.2037030797047816E-07 0.6811729842303346E-04 > qv( 51)= 0.0000000000000000E+00 0.2825454884955962E-09 0.4560204117498489E-05 > qv( 52)= 0.0000000000000000E+00 0.1111626274560953E-08 0.1243698046534483E-04 > qv( 53)= 0.3129344538902286E-26 0.7476074247379579E-08 0.5144196759571649E-04 > qv( 54)= 0.5969168906385217E-20 0.1600635694817341E-07 0.8501734607568424E-04 > qv( 55)= 0.1224888580310986E-15 0.2481713009214718E-07 0.1939361380241852E-03 > qv( 56)= 0.4811573798262957E-18 0.1285645310793218E-07 0.3005787393812239E-03 > qv( 57)= 0.6729270980557783E-18 0.1815429696434393E-08 0.2931609932237174E-04 10779c10787 < qv( 60)= 0.1377066822007143E-18 0.4917096222287206E-07 0.1657727773871236E-03 --- > qv( 60)= 0.1377066822007143E-18 0.4917096222287200E-07 0.1657727773871115E-03 10781,10785c10789,10793 < qv( 62)= 0.5786130801713453E-18 0.1875702607016171E-07 0.2084449676500296E-04 < qv( 63)= 0.2626338928148987E-19 0.1658764775014480E-06 0.6799434488051050E-03 < qv( 64)= 0.2711853761383372E-18 0.7313312249696860E-09 0.2452553175566627E-04 < qv( 65)= 0.9999983749122426E-36 0.9749741354215763E-08 0.3209747852630549E-04 < qv( 66)= 0.3008065825741959E-19 0.1781263385073317E-07 0.3550445119086209E-04 --- > qv( 62)= 0.5786130801713453E-18 0.1875702607016263E-07 0.2084449676501500E-04 > qv( 63)= 0.2626338928148987E-19 0.1658764775014477E-06 0.6799434488051084E-03 > qv( 64)= 0.2711853761383372E-18 0.7313312249696860E-09 0.2452553175566601E-04 > qv( 65)= 0.9999983749122426E-36 0.9749741354215763E-08 0.3209747852630602E-04 > qv( 66)= 0.3008065825741959E-19 0.1781263385071268E-07 0.3550445119085760E-04 10787c10795 < qv( 68)= 0.6124473061726041E-19 0.7507250287753063E-08 0.4964390737576436E-05 --- > qv( 68)= 0.6124473061726041E-19 0.7507250287753494E-08 0.4964390737580927E-05 10790,10791c10798,10799 < qv( 71)= 0.9331914728252885E-20 0.3576705100879151E-08 0.7775653887357698E-04 < qv( 72)= 0.6860296583400418E-19 0.2085632852451868E-10 0.4142201493590905E-06 --- > qv( 71)= 0.9331914728252885E-20 0.3576705100879154E-08 0.7775653887357699E-04 > qv( 72)= 0.6860296583400418E-19 0.2085632852451868E-10 0.4142201493590860E-06 10794c10802 < qv( 75)= 0.5035456861836775E-20 0.5973533416995098E-10 0.1572182392332986E-06 --- > qv( 75)= 0.5035456861836775E-20 0.5973533416995098E-10 0.1572182392332972E-06 10796,10798c10804,10806 < qv( 77)= 0.4064329953736765E-27 0.5787709536043546E-13 0.2526605671365271E-09 < qv( 78)= 0.2390293873498410E+01 0.1674681283137759E+11 0.2898491366155993E+15 < qv( 79)= 0.2752446462633644E+04 0.4405035291420130E+11 0.1160881272021377E+16 --- > qv( 77)= 0.4064329953736765E-27 0.5787709536043546E-13 0.2526605671365197E-09 > qv( 78)= 0.2390293873498410E+01 0.1674681283133666E+11 0.2898491366155886E+15 > qv( 79)= 0.2752446462633644E+04 0.4405035291420130E+11 0.1160881272021378E+16 10800c10808 < qv( 81)= 0.1160979429936086E+03 0.1431532082705426E+11 0.1673615045986110E+14 --- > qv( 81)= 0.1160979429936086E+03 0.1431532082705502E+11 0.1673615045987088E+14 10821c10829 < Q 11,Q diss, dQ^2/dt: 0.28768100873134E-09 kg/m^2 0.0000000E+00 -0.5528370E-28 --- > Q 11,Q diss, dQ^2/dt: 0.28768100873134E-09 kg/m^2 -0.2872155E-28 -0.5528370E-28 10827c10835 < Q 17,Q diss, dQ^2/dt: 0.14889656473446E-05 kg/m^2 -0.2352869E-24 -0.1360634E-19 --- > Q 17,Q diss, dQ^2/dt: 0.14889656473446E-05 kg/m^2 0.0000000E+00 -0.1360634E-19 10844c10852 < Q 34,Q diss, dQ^2/dt: 0.59116239533281E-06 kg/m^2 -0.5882173E-25 -0.2098049E-19 --- > Q 34,Q diss, dQ^2/dt: 0.59116239533281E-06 kg/m^2 0.0000000E+00 -0.2098049E-19 10858,10861c10866,10869 < Q 48,Q diss, dQ^2/dt: 0.15558714778144E-05 kg/m^2 0.0000000E+00 -0.5898807E-19 < Q 49,Q diss, dQ^2/dt: 0.97772411959660E-06 kg/m^2 0.0000000E+00 -0.7434098E-20 < Q 50,Q diss, dQ^2/dt: 0.55277523281802E-06 kg/m^2 0.0000000E+00 -0.1121537E-20 < Q 51,Q diss, dQ^2/dt: 0.37006281093104E-07 kg/m^2 0.0000000E+00 -0.3682996E-23 --- > Q 48,Q diss, dQ^2/dt: 0.15558714778144E-05 kg/m^2 0.1176435E-24 -0.5898807E-19 > Q 49,Q diss, dQ^2/dt: 0.97772411959704E-06 kg/m^2 0.0000000E+00 -0.7434098E-20 > Q 50,Q diss, dQ^2/dt: 0.55277523281751E-06 kg/m^2 0.0000000E+00 -0.1121537E-20 > Q 51,Q diss, dQ^2/dt: 0.37006281093104E-07 kg/m^2 0.7352717E-26 -0.3682996E-23 10863c10871 < Q 53,Q diss, dQ^2/dt: 0.41745410156632E-06 kg/m^2 0.0000000E+00 -0.1609623E-21 --- > Q 53,Q diss, dQ^2/dt: 0.41745410156633E-06 kg/m^2 0.0000000E+00 -0.1609623E-21 10865c10873 < Q 55,Q diss, dQ^2/dt: 0.15738013152294E-05 kg/m^2 0.0000000E+00 -0.2612722E-20 --- > Q 55,Q diss, dQ^2/dt: 0.15738013152295E-05 kg/m^2 0.0000000E+00 -0.2612722E-20 10870c10878 < Q 60,Q diss, dQ^2/dt: 0.13452542560611E-05 kg/m^2 0.0000000E+00 -0.6055291E-20 --- > Q 60,Q diss, dQ^2/dt: 0.13452542560610E-05 kg/m^2 0.0000000E+00 -0.6055291E-20 10872,10873c10880,10881 < Q 62,Q diss, dQ^2/dt: 0.16915411824880E-06 kg/m^2 0.0000000E+00 -0.3071824E-21 < Q 63,Q diss, dQ^2/dt: 0.55177745876206E-05 kg/m^2 0.0000000E+00 -0.1186406E-18 --- > Q 62,Q diss, dQ^2/dt: 0.16915411824889E-06 kg/m^2 0.0000000E+00 -0.3071824E-21 > Q 63,Q diss, dQ^2/dt: 0.55177745876206E-05 kg/m^2 -0.4705739E-24 -0.1186406E-18 10876c10884 < Q 66,Q diss, dQ^2/dt: 0.28812037070528E-06 kg/m^2 0.0000000E+00 -0.5253375E-21 --- > Q 66,Q diss, dQ^2/dt: 0.28812037070525E-06 kg/m^2 0.0000000E+00 -0.5253375E-21 10878c10886 < Q 68,Q diss, dQ^2/dt: 0.40286275429164E-07 kg/m^2 0.0000000E+00 -0.2238563E-22 --- > Q 68,Q diss, dQ^2/dt: 0.40286275429201E-07 kg/m^2 0.0000000E+00 -0.2238563E-22 10885c10893 < Q 75,Q diss, dQ^2/dt: 0.12758337574639E-08 kg/m^2 0.0000000E+00 -0.1953562E-25 --- > Q 75,Q diss, dQ^2/dt: 0.12758337574638E-08 kg/m^2 0.0000000E+00 -0.1953562E-25 10887,10888c10895,10896 < Q 77,Q diss, dQ^2/dt: 0.20503529508075E-11 kg/m^2 0.0000000E+00 -0.5432745E-31 < Q 78,Q diss, dQ^2/dt: 0.23521400243975E+13 kg/m^2 0.0000000E+00 -0.7970347E+15 --- > Q 77,Q diss, dQ^2/dt: 0.20503529508074E-11 kg/m^2 0.0000000E+00 -0.5432745E-31 > Q 78,Q diss, dQ^2/dt: 0.23521400243974E+13 kg/m^2 0.0000000E+00 -0.7970347E+15 10891c10899 < Q 81,Q diss, dQ^2/dt: 0.13581468556584E+12 kg/m^2 0.0000000E+00 -0.1948560E+15 --- > Q 81,Q diss, dQ^2/dt: 0.13581468556592E+12 kg/m^2 0.0000000E+00 -0.1948560E+15 10900c10908 < WSHIST: writing time sample 1 to h-file 5 DATE=0001/01/01 NCSEC= 3600 --- > WSHIST: writing time sample 1 to h-file 5 DATE=2009/01/01 NCSEC= 3600 10902,10904c10910,10912 < nstep, te 3 0.26322667852254233E+10 0.26322698364603581E+10 0.16870302957907709E-03 0.98542439725779899E+05 < nstep, te 4 0.26322488537125998E+10 0.26322511745672755E+10 0.12832034949092754E-03 0.98542359649151069E+05 < WSHIST: writing time sample 2 to h-file 5 DATE=0001/01/01 NCSEC= 7200 --- > nstep, te 3 0.26322667854541326E+10 0.26322698365469265E+10 0.16869517058992173E-03 0.98542439731073900E+05 > nstep, te 4 0.26322487869156752E+10 0.26322511035693631E+10 0.12808807614960881E-03 0.98542359796819539E+05 > WSHIST: writing time sample 2 to h-file 5 DATE=2009/01/01 NCSEC= 7200 10906,10908c10914,10916 ```


Steps to reproduce in a simple ne4 test (done on 78a5ea1):

  1. Get the 78a5ea1 code
  2. In cime/scripts, do ./create_test SMS.ne4_oQU480.F2010 -t F2010_test_0001 -p e3sm --debug --compiler=gnu or similar
  3. In cime/scripts, do ./create_test SMS.ne4_oQU480.F2010 -t F2010_test_2009 -p e3sm --debug --compiler=gnu or similar
  4. Go to the latter test's dir (F2010_test_2009) and do ./xmlchange RUN_STARTDATE=2009-01-01, then ./case.submit
  5. Compare the atm logs in the run directory (note to compare the second one in F2010_test_2009 because the first defaults to 0001-01-01)

The diff in the ne4 atm logs: https://gist.github.com/mahf708/16cbb8c127751c79a4160b66b321aecd

mahf708 commented 1 year ago
  1. Maybe "climate changing" is not accurate here because it is just a short F2010, but I expect it to be climate changing if tested rigorously.
  2. Bug exists on maint-2.0 (tested on 9dfef8b)
wlin7 commented 1 year ago

Can use single test REP_Ln9.ne4_oQU240.F2010.pm-cpu_intel.eam-outfrq9s to re-produce the problem (need to reset RUN_STARTDATE=2009-01-01 for case2)

Even FIDEAL has this STARTDATE sensitivity. FIDEAL is with the following configuration

[Corrected. The FIDEAL test is misleading. Though there is FAIL REP_Ld9.ne4_ne4.FIDEAL.pm-cpu_intel.allactive-pioroot1 COMPARE_base_rep2 from the test report, the FAIL is for other reason. The stats in atm.log are identical between 1st and 2nd REP runs with 0001-01-01 and 2009-01-09, respectively for RUN_STARTDATE. The test, on the other hand, suggests the issue is with the standard physics/chemistry.]

golaz commented 1 year ago

@mahf708 : did you check whether F1850 suffers from the same issue?

mahf708 commented 1 year ago

@mahf708 : did you check whether F1850 suffers from the same issue?

Yes, F1850 suffers from this issue in very similar ways as F2010. We are working behind the scenes to better understand it (as Wuyin pointed out, we currently think this might be in chemini)

wlin7 commented 1 year ago

Quick update: The issue has been isolated. The finding is after joint debugging efforts by @wlin7 , @singhbalwinder and @mahf708 . All are welcome to chime in whether we need a solution to eliminate such sensitivity.

The issue comes down to round-off errors in representing irrational numbers for model time and the ensuing calculations of interpolation weights. Below is a comparison of the calculations at model time step 1. fact1 is the 1st linear interpolation weight in subroutine interpolate_trcdata, computed as (datatime - modeltime)/deltat . deltat is the time interval between two source data points.

STARTDATE=2009-01-01   fact1 = (380.000000000000-365.083333333333)/31. = 0.481182795698925
STARTDATE=0001-01-01   fact1 = (733300.000000000-733285.083333333)/31. = 0.481182795697673

The floating numbers are taken from model printout as is. Offline calculation using shell's bc matches the results up to the10th decimal point. The data time and model time are relative to a reference time (STARTDATE). The model time is cast in the year over which the forcing is cycled -- which is 2010 in this case.

The representation of the model time is limited to the significant digits for the data type. The two start dates are far apart, causing the number of decimal points for the model time (base day + 2 / 24, time step is 2 hours in the test) to be different. If the start dates are less than 3 years apart, the simulations can still be BFB.

wlin7 commented 1 year ago

Correction: Closer start dates (to seemingly maintain consistent significant digits) still can not guarantee BFB. Tests with 2008-01-01 vs 2009-01-01, and the results differ. A simple Fortran code confirms this, emulating the case of 2008 vs 2009 with forcing data cycling over 2010.

  real*8 dt1,mt1,dt2,mt2,deltat
  deltat=31.0
  dt1=745.0
  dt2=380.0
  mt1=730.0+2./24.
  mt2=365.0+2./24.
  write(6,*)'(dt1-mt1)/deltat',(dt1-mt1)/deltat
  write(6,*)'(dt2-mt2)/deltat',(dt2-mt2)/deltat

With pm-cpu_intel, it gave

  (dt1-mt1)/deltat  0.481183451990927
  (dt2-mt2)/deltat  0.481182467552923
ndkeen commented 1 year ago

I'm not sure if this could explain all of your issues, but I've noticed that constants can behave differently if you don't specify d0. Try experimenting with:

program doop
  implicit none
  real(kind=8) ::  dt1,mt1,dt2,mt2,deltat
  real(kind=8) ::  q,r,s,t, nt1,nt2

  deltat=31.0d0
  dt1=745.0d0
  dt2=380.0d0
  mt1=730.0d0+2.0d0/24.0d0
  mt2=365.0d0+2.0d0/24.0d0

  q=2.0d0
  r=24.0d0
  s=730.0d0
  t=365.0d0
  nt1 = s + q/r
  nt2 = t + q/r

  write(*,'(5(a,es25.16))') " dt1=", dt1, " mt1=", mt1, " dt1-mt1=", dt1-mt1, " 2/24=", 2.0d0/24.0d0, " (dt1-mt1)/deltat=", (dt1-mt1)/deltat
  write(*,'(5(a,es25.16))') " dt2=", dt2, " mt2=", mt2, " dt2-mt2=", dt2-mt2, " 2/24=", 2.0d0/24.0d0, " (dt2-mt2)/deltat=", (dt2-mt2)/deltat
  write(*,*)
  write(*,'(5(a,es25.16))') " dt1=", dt1, " nt1=", nt1, " dt1-nt1=", dt1-nt1, " q/r= ", q/r,    " (dt1-nt1)/deltat=", (dt1-nt1)/deltat
  write(*,'(5(a,es25.16))') " dt2=", dt2, " nt2=", nt2, " dt2-nt2=", dt2-nt2, " q/r= ", q/r,    " (dt2-nt2)/deltat=", (dt2-nt2)/deltat

  !dt1=   7.4500000000000000E+02 mt1=   7.3008333333333337E+02 dt1-mt1=   1.4916666666666629E+01 2/24=   8.3333333333333329E-02 (dt1-mt1)/deltat=   4.8118279569892353E-01                                                                                                                                           
  !dt2=   3.8000000000000000E+02 mt2=   3.6508333333333331E+02 dt2-mt2=   1.4916666666666686E+01 2/24=   8.3333333333333329E-02 (dt2-mt2)/deltat=   4.8118279569892536E-01                                                                                                                                           

  !dt1=   7.4500000000000000E+02 nt1=   7.3008333333333337E+02 dt1-nt1=   1.4916666666666629E+01 q/r=    8.3333333333333329E-02 (dt1-nt1)/deltat=   4.8118279569892353E-01                                                                                                                                           
  !dt2=   3.8000000000000000E+02 nt2=   3.6508333333333331E+02 dt2-nt2=   1.4916666666666686E+01 q/r=    8.3333333333333329E-02 (dt2-nt2)/deltat=   4.8118279569892536E-01                                                                                                                                           

end program doop

I think there are flags that can control how FP constants are treated, but I've been trying some without much luck in this case. I know that some codes (like MPAS I think), are pedantic -- always using something like d0 for constants.

rljacob commented 1 year ago

precision of constants should always be specified in any of the E3SM Fortran code.

When you say "The model time is cast in the year over which the forcing is cycled" that means no matter what year "startdate" is, it will be sometime in 2010 as far as the external forcing (gases, SST, solar cycle) is concerned? I was never clear on how that worked.

wlin7 commented 1 year ago

Sorry, @ndkeen , @rljacob . The simple Fortran example may have become a distraction from what was intended to be conveyed. Thanks for the refined test with more consistent treatment of data type and constant. I did neglect that in the example. The message is unchanged: the round-off errors can lead to NBFB results in otherwise mathematically BFB expressions, with only difference in the base integer part of the expression.

And certainly, all the numbers shown on the initial update message are computed using proper data type (all r8) during the model run.

ndkeen commented 1 year ago

I experimented more and see that I was able to get something like 2./24. to yield same as 2.0d0/24.d0 if I did:

gfortran -Wconversion-extra -ffree-line-length-none -fdefault-double-8 -fdefault-real-8 deltat-FP-issue.f90                                                                                                                                                  
ifort -double-size 64 -real-size 64 deltat-FP-issue.f90                                                                                                                                                                                                      

Do we know which fortran files might contain the code that does this computation?

mahf708 commented 1 year ago

We mostly looked at components/eam/src/chemistry/utils/tracer_data.F90 and in particular this subroutine: https://github.com/E3SM-Project/E3SM/blob/908f55d9399d43f00d89dd619478618974f549c3/components/eam/src/chemistry/utils/tracer_data.F90#L1713C4-L1713C4

wlin7 commented 1 year ago

When you say "The model time is cast in the year over which the forcing is cycled" that means no matter what year "startdate" is, it will be sometime in 2010 as far as the external forcing (gases, SST, solar cycle) is concerned? I was never clear on how that worked.

@rljacob , effectively yes; and it is only for the model time viewed by the forcing file (file%curr_mod_time in traracer_data.F90). It is done only for cyclical forcing type. It make linear interpolation straightforward by having data times and this relative current model time in the same frame, regardless of actual model time (dictated by RUN_STARTDATE) and how far apart of the time from the CYCLE year for the forcing file. All those time values, are in units of days w.r.t. a reference time.

wlin7 commented 1 year ago
  > I experimented more and see that I was able to get something like `2./24.` to yield same as `2.0d0/24.d0` if I did:
gfortran -Wconversion-extra -ffree-line-length-none -fdefault-double-8 -fdefault-real-8 deltat-FP-issue.f90                                                                                                                                                  
ifort -double-size 64 -real-size 64 deltat-FP-issue.f90                                                                                                                                                                                                      

Cool you were able to dig out all these compilation options, @ndkeen . Thanks. In tracer_data.F90, one such calculation is file%next_mod_time = file%curr_mod_time + get_step_size()/86400._r8, to advance one step.

golaz commented 1 year ago

Thank you @mahf708 and @wlin7 for isolating this problem. It's very reassuring that it is not directly a problem with a time-varying dataset that should not have been time-varying. That was my major concern. We should not delay the next configuration of v3alpha because of this.

One possible fix would be to use integer arithmetic to keep track of time. Maybe express the time in seconds with long integer. But we may need finer time resolution. I also assume that this type of constructs are wide-spread throughout the code, so it would require extensive modifications.

I wonder if EAMxx suffers from the same issue; @ndkeen might know. If EAMxx doesn't have this problem, maybe we wait until we all transition to that new code base.