NOAA-GFDL / icebergs

GFDL Climate Model Icebergs
Other
5 stars 19 forks source link

CM4 does not reproduce across a change in ice_layout , unless icebergs are off #13

Closed nikizadehgfdl closed 9 years ago

nikizadehgfdl commented 9 years ago

This is a very old issue which was first seen in ESM2 years ago.

The CM4 coupled model (using SIS2 and its old icebergs module) does not produce the same answers when ice_layout is changed. When I turn off the icebergs the answers are bitwise identical across ice_layout change.

This is with repro mode and with make_exchange_reproduce=.true., but I think neither has an effect here.

I believe this issue persists if I swap SIS2 with SIS1 . No reason to go away with new icebergs module either.

Here's the two configs that do not reproduce (ALL restart files differ) unless I turn off the bergs.
They differ only in ice_layout 72,4 vs 96,3

 else if ( "$npes" == "2560" ) then
  set atmos_npes = "288"
  set atmos_nthreads = "2"
  set nxblocks = "4" ; set nyblocks = "2" 
  set fv_layout    =   "4,12";  set fv_io_layout    =  "1,4"
  set land_layout  =   "4,12";  set land_io_layout  =  "1,4"
  set ice_layout   =   "72,4";  set ice_io_layout   =  "1,4"
  set ocn_layout   =   "36,72"; set ocn_io_layout   =  "1,4"; set ocn_mask_table = "mask_table.622.36x72"
  set ocean_npes = "1970"

else if ( "$npes" == "2561" ) then
  set atmos_npes = "288"
  set atmos_nthreads = "2"
  set nxblocks = "4" ; set nyblocks = "2" 
  set fv_layout    =   "4,12";  set fv_io_layout    =  "1,4"
  set land_layout  =   "4,12";  set land_io_layout  =  "1,4"
  set ice_layout   =   "96,3";  set ice_io_layout   =  "1,3"
  set ocn_layout   =   "36,72"; set ocn_io_layout   =  "1,4"; set ocn_mask_table = "mask_table.622.36x72"
  set ocean_npes = "1970"

The experiments I tried are:

CM4_c96L32_am4g5r2_2000_sis2 which has the issue

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010111.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010111.tar
DIFFER : ALL
    CROSSOVER   FAILED: CM4_c96L32_am4g5r2_2000_sis2

CM4_c96L32_am4g5r2_2000_sis2_nobergs which does not have the issue

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_nobergs/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010111.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_nobergs/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010111.tar

    CROSSOVER   PASSED: CM4_c96L32_am4g5r2_2000_sis2_nobergs
nikizadehgfdl commented 9 years ago

BTW the model does reproduce across a fv_layout change, atmos_threads change or ocean_layout change (with no mask_table).

nikizadehgfdl commented 9 years ago

@underwoo wrote: "Looking at the stdout's from the CM4_c96L32_am4g5r2_2000_sis2 runs, the "Total Ice Mass|Salt|Heat" are all slightly different from the first print out. Points to something in the iceberg initialization. (The noberg runs do not show the difference in "Total Ice ..".)

Also, I don't see any iceberg restart files in the initCond file. Could you please run a test that uses iceberg restart files to see if that will reproduce across layout changes."

So, I did try that and the answers indeed reproduced across the same ice_layout change!

/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010121.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010121.tar

the only difference was
      Comparing icebergs.res.nc...
DIFFER : VARIABLE : lon : POSITION : 0 : VALUES : -265.452 <> -267.218

All I did was to use one of the restart tars from a 10 days experiment (the 2560 one) as the initCond and repeat the runs.

Here are the stdouts:

/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond_1x0m10d_2560pe.o5041290 

/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond_1x0m10d_2561pe.o5041289

So, Seth, what do you make of this?

jwdGFDL commented 9 years ago

If I recall correctly, the initialization algorithm picks out the icebergs a given rank owns. One possibility is that there is a layout dependent flaw that may attribute an iceberg to multiple ranks or to no rank thus leaving it out of further simulation.

On 08/12/2015 07:14 PM, Niki Zadeh wrote:

@underwoo https://github.com/underwoo wrote: "Looking at the stdout's from the CM4_c96L32_am4g5r2_2000_sis2 runs, the "Total Ice Mass|Salt|Heat" are all slightly different from the first print out. Points to something in the iceberg initialization. (The noberg runs do not show the difference in "Total Ice ..".)

Also, I don't see any iceberg restart files in the initCond file. Could you please run a test that uses iceberg restart files to see if that will reproduce across layout changes."

So, I did try that and the answers indeed reproduced across the same ice_layout change!

|/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010121.tar \ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010121.tar the only difference was Comparing icebergs.res.nc... DIFFER : VARIABLE : lon : POSITION : 0 : VALUES : -265.452 <> -267.218 |

All I did was to use one of the restart tars from a 10 days experiment (the 2560 one) as the initCond and repeat the runs.

So, Seth, what do you make of this?

— Reply to this email directly or view it on GitHub https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-130474349.

Jeff Durachta Engineering Lead for Modeling Services NOAA Geophysical Fluid Dynamics Lab Forrestal Campus, Princeton University 201 Forrestal Road Princeton, NJ 08540 Office: +1-609-987-5054

adcroft commented 9 years ago

I've been unable to make an ice-ocean configuration fail reproducibility tests in which I seed every model cell with four bergs moving in the cardinal directions.

Looking at the logs @nikizadehgfdl provided it looks like there is a difference in the calving restart checksum. How does this happen?

> grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum=           -1896008147 chksum2=           -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum=            -185424424 chksum2=            1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11

There is also this line:

< OCN(ATMOCNLND)=  0.354793438964402       0.354793438964402    0.354793438964402
> OCN(ATMOCNLND)=  0.354433472151885       0.354433472151885    0.354433472151885

which has nothing todo with icebergs.

Zhi-Liang commented 9 years ago

Hi Niki,

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885

This printout is from xgrid.F90. This caculation is based on some random number. So it can not reproduce between processor count.

Zhi

On Tue, Aug 18, 2015 at 10:02 AM, Alistair Adcroft <notifications@github.com

wrote:

I've been unable to make an ice-ocean configuration fail reproducibility tests in which I seed every model cell with four bergs moving in the cardinal directions.

Looking at the logs @nikizadehgfdl https://github.com/nikizadehgfdl provided it looks like there is a difference in the calving restart checksum. How does this happen?

grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum= -1896008147 chksum2= -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum= -185424424 chksum2= 1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11

There is also this line:

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885

which has nothing todo with icebergs.

— Reply to this email directly or view it on GitHub https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-132219088.

underwoo commented 9 years ago

There is a namelist options 'make_calving_reproduce' in the ice_sis version of ice_bergs. Niki, please check if this option is in the new icebergs, and if it is set to .true. in your namelists.

Seth Underwood Engility

Modeling Systems Group GFDL/NOAA/DOC 201 Forrestal Road Princeton, NJ 08540-6649

(609) 452-5847 Office (304) 376-9002 Cell (609) 987-5063 Fax Seth.Underwood@noaa.gov

On Tue, Aug 18, 2015 at 10:09 AM, Zhi Liang notifications@github.com wrote:

Hi Niki,

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885

This printout is from xgrid.F90. This caculation is based on some random number. So it can not reproduce between processor count.

Zhi

On Tue, Aug 18, 2015 at 10:02 AM, Alistair Adcroft < notifications@github.com

wrote:

I've been unable to make an ice-ocean configuration fail reproducibility tests in which I seed every model cell with four bergs moving in the cardinal directions.

Looking at the logs @nikizadehgfdl https://github.com/nikizadehgfdl provided it looks like there is a difference in the calving restart checksum. How does this happen?

grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum= -1896008147 chksum2= -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum= -185424424 chksum2= 1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11

There is also this line:

< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402

OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885

which has nothing todo with icebergs.

— Reply to this email directly or view it on GitHub <https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-132219088 .

— Reply to this email directly or view it on GitHub https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-132221639.

nikizadehgfdl commented 9 years ago

Thanks, that was the problem. The model reproduced across ice_layout change after I set the iceberg namelist make_calving_reproduce = .true.