Closed nikizadehgfdl closed 9 years ago
BTW the model does reproduce across a fv_layout change, atmos_threads change or ocean_layout change (with no mask_table).
@underwoo wrote: "Looking at the stdout's from the CM4_c96L32_am4g5r2_2000_sis2 runs, the "Total Ice Mass|Salt|Heat" are all slightly different from the first print out. Points to something in the iceberg initialization. (The noberg runs do not show the difference in "Total Ice ..".)
Also, I don't see any iceberg restart files in the initCond file. Could you please run a test that uses iceberg restart files to see if that will reproduce across layout changes."
So, I did try that and the answers indeed reproduced across the same ice_layout change!
/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010121.tar
\\\ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010121.tar
the only difference was
Comparing icebergs.res.nc...
DIFFER : VARIABLE : lon : POSITION : 0 : VALUES : -265.452 <> -267.218
All I did was to use one of the restart tars from a 10 days experiment (the 2560 one) as the initCond and repeat the runs.
Here are the stdouts:
/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond_1x0m10d_2560pe.o5041290
/lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond_1x0m10d_2561pe.o5041289
So, Seth, what do you make of this?
If I recall correctly, the initialization algorithm picks out the icebergs a given rank owns. One possibility is that there is a layout dependent flaw that may attribute an iceberg to multiple ranks or to no rank thus leaving it out of further simulation.
On 08/12/2015 07:14 PM, Niki Zadeh wrote:
@underwoo https://github.com/underwoo wrote: "Looking at the stdout's from the CM4_c96L32_am4g5r2_2000_sis2 runs, the "Total Ice Mass|Salt|Heat" are all slightly different from the first print out. Points to something in the iceberg initialization. (The noberg runs do not show the difference in "Total Ice ..".)
Also, I don't see any iceberg restart files in the initCond file. Could you please run a test that uses iceberg restart files to see if that will reproduce across layout changes."
So, I did try that and the answers indeed reproduced across the same ice_layout change!
|/// /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2560pe/restart/00010121.tar \ /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2_withBergsInInitcond/ncrc2.intel-repro-openmp/archive/1x0m10d_2561pe/restart/00010121.tar the only difference was Comparing icebergs.res.nc... DIFFER : VARIABLE : lon : POSITION : 0 : VALUES : -265.452 <> -267.218 |
All I did was to use one of the restart tars from a 10 days experiment (the 2560 one) as the initCond and repeat the runs.
So, Seth, what do you make of this?
— Reply to this email directly or view it on GitHub https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-130474349.
Jeff Durachta Engineering Lead for Modeling Services NOAA Geophysical Fluid Dynamics Lab Forrestal Campus, Princeton University 201 Forrestal Road Princeton, NJ 08540 Office: +1-609-987-5054
I've been unable to make an ice-ocean configuration fail reproducibility tests in which I seed every model cell with four bergs moving in the cardinal directions.
Looking at the logs @nikizadehgfdl provided it looks like there is a difference in the calving restart checksum. How does this happen?
> grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum= -1896008147 chksum2= -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11
CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum= -185424424 chksum2= 1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11
There is also this line:
< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402
> OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885
which has nothing todo with icebergs.
Hi Niki,
< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402
OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885
This printout is from xgrid.F90. This caculation is based on some random number. So it can not reproduce between processor count.
Zhi
On Tue, Aug 18, 2015 at 10:02 AM, Alistair Adcroft <notifications@github.com
wrote:
I've been unable to make an ice-ocean configuration fail reproducibility tests in which I seed every model cell with four bergs moving in the cardinal directions.
Looking at the logs @nikizadehgfdl https://github.com/nikizadehgfdl provided it looks like there is a difference in the calving restart checksum. How does this happen?
grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum= -1896008147 chksum2= -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum= -185424424 chksum2= 1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11
There is also this line:
< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402
OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885
which has nothing todo with icebergs.
— Reply to this email directly or view it on GitHub https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-132219088.
There is a namelist options 'make_calving_reproduce' in the ice_sis version of ice_bergs. Niki, please check if this option is in the new icebergs, and if it is set to .true. in your namelists.
Seth Underwood Engility
Modeling Systems Group GFDL/NOAA/DOC 201 Forrestal Road Princeton, NJ 08540-6649
(609) 452-5847 Office (304) 376-9002 Cell (609) 987-5063 Fax Seth.Underwood@noaa.gov
On Tue, Aug 18, 2015 at 10:09 AM, Zhi Liang notifications@github.com wrote:
Hi Niki,
< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402
OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885
This printout is from xgrid.F90. This caculation is based on some random number. So it can not reproduce between processor count.
Zhi
On Tue, Aug 18, 2015 at 10:02 AM, Alistair Adcroft < notifications@github.com
wrote:
I've been unable to make an ice-ocean configuration fail reproducibility tests in which I seed every model cell with four bergs moving in the cardinal directions.
Looking at the logs @nikizadehgfdl https://github.com/nikizadehgfdl provided it looks like there is a difference in the calving restart checksum. How does this happen?
grep restart_calv /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337 /lustre/f1/Niki.Zadeh/ulm_201505_awg_v20150702_mom6sis2_2015.08.06b/CM4_c96L32_am4g5r2_2000_sis2/ncrc2.intel-repro-openmp/stdout/run/CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2560pe.o7149337:diamonds, grd_chksum3: read_restart_calvi chksum= -1896008147 chksum2= -1545844752 min= 0.000000000E+00 max= 7.399996075E+11 mean= 9.751374716E+10 rms= 1.634493874E+11 sd= 1.311745835E+11 CM4_c96L32_am4g5r2_2000_sis2_1x0m10d_2561pe.o7149320:diamonds, grd_chksum3: read_restart_calvi chksum= -185424424 chksum2= 1140423430 min= 0.000000000E+00 max= 7.399992858E+11 mean= 9.750156607E+10 rms= 1.634237307E+11 sd= 1.311516694E+11
There is also this line:
< OCN(ATMOCNLND)= 0.354793438964402 0.354793438964402 0.354793438964402
OCN(ATMOCNLND)= 0.354433472151885 0.354433472151885 0.354433472151885
which has nothing todo with icebergs.
— Reply to this email directly or view it on GitHub <https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-132219088 .
— Reply to this email directly or view it on GitHub https://github.com/NOAA-GFDL/icebergs/issues/13#issuecomment-132221639.
Thanks, that was the problem. The model reproduced across ice_layout change after I set the iceberg namelist make_calving_reproduce = .true.
This is a very old issue which was first seen in ESM2 years ago.
The CM4 coupled model (using SIS2 and its old icebergs module) does not produce the same answers when ice_layout is changed. When I turn off the icebergs the answers are bitwise identical across ice_layout change.
This is with repro mode and with make_exchange_reproduce=.true., but I think neither has an effect here.
I believe this issue persists if I swap SIS2 with SIS1 . No reason to go away with new icebergs module either.
Here's the two configs that do not reproduce (ALL restart files differ) unless I turn off the bergs.
They differ only in ice_layout 72,4 vs 96,3
The experiments I tried are:
CM4_c96L32_am4g5r2_2000_sis2 which has the issue
CM4_c96L32_am4g5r2_2000_sis2_nobergs which does not have the issue