ESMG / SIS2

NOAA-GFDL's Sea Ice Simulator version 2
Other
0 stars 0 forks source link

New Ice ridging #3

Closed kshedstrom closed 3 years ago

kshedstrom commented 3 years ago

Old ice concentration:

cice_old

Ridged ice concentration:

cice_ridge

Old ice thickness:

hice_old

Ridged ice thickness:

hice_ridge

Is this what it is supposed to look like? These are all ten days into a run. Especially annoying is that it is extremely slow compared to the old code. @MJHarrison-GFDL @Hallberg-NOAA

MJHarrison-GFDL commented 3 years ago

Fascinating difference after only 10 days. We may be able to get the cost down. Can we estimate how much of the ridging overhead is time spent in the Icepack routines itself (of which we have little control)? It may be poorly load balanced as a result of the ridging as well.

The cost of the sea-ice should be small in comparison to the ocean model. Is that not the case with the ridging?

kshedstrom commented 3 years ago

The old run took 52 hours to run for six months. The current run has gone that long and is only 17 days into it. Yes, I can imagine that there could be load-balancing issues.

Hallberg-NOAA commented 3 years ago

This certainly is a dramatic difference, and it was moving in the direction we were looking for, with increased wintertime leads.

  The performance difference is shocking.  It is hard to imagine why the case with ridging would be 3 times as expensive as the entire model was before. However, we do have timers that we can use (or add) to track down where all that time is going.

kshedstrom commented 3 years ago

It's more like a factor of ten more costly. How would you go about turning on all the timers?

MJHarrison-GFDL commented 3 years ago

What are the timings reported from the coupler level? At the end of stdout, you will see min/max timings for ATM/ICE/OCN .

kshedstrom commented 3 years ago

The timings for one day aren't that scary. The six-month run is still going and will run out of cpu time before it finishes. One day:

Total runtime                      1402.299702   1402.406452   1402.346276      0.020759  1.000     0     0   311  
Initialization                       48.930966     49.025146     48.976529      0.020481  0.035     0     0   311  
Ice                                 186.555966    191.133310    188.974347      1.116731  0.135     1     0   311  
Ice Fast                              0.202755      7.066391      1.883114      1.906944  0.001    11     0   311  
Ice Slow                            183.853468    190.013848    187.091070      1.649962  0.133    11     0   311  
Ice Fast/Slow Exchange                0.000006      0.001423      0.000118      0.000358  0.000    11     0   311  
Ocean Initialization                 76.210434     76.370668     76.260599      0.025714  0.054    11     0   311  
Ocean                               942.653283    942.712896    942.699539      0.012268  0.672     1     0   311  
Ocean dynamics                      464.256399    823.788685    550.768139     58.162590  0.393    11     0   311  
Ocean thermodynamics and tracers     97.058084    453.055836    359.518532     55.929550  0.256    11     0   311  
Ocean Other                          15.253065     20.663595     17.301125      1.142110  0.012    11     0   311  
ATM                                 242.163030    246.963261    245.869279      1.293262  0.175     0     0   311
MJHarrison-GFDL commented 3 years ago

Well that looks pretty darn good in terms of being load balanced. So it should be taking ~70 hours/6 months with 312 cores. How long was a 6 month run taking before using the ridging? We can't be talking more than a 20% increase from the ridging, unless there is something else going on which degrades the performance over time ... Do you want me to try a test on Gaea?

MJHarrison-GFDL commented 3 years ago

Are you using a different number of cores for your production run?

kshedstrom commented 3 years ago

I'm using 312 cores for the production run, same as the short run. We don't have a huge cluster.

MJHarrison-GFDL commented 3 years ago

Memory leak?

kshedstrom commented 3 years ago

Seems like a likely explanation. Can you tell?

MJHarrison-GFDL commented 3 years ago

No. Sounds like we should run a longer test on Gaea.

kshedstrom commented 3 years ago

I'm running a five-day test now. I've asked the guys if they can tell how much memory is being used - no answer yet.

kshedstrom commented 3 years ago

The supercomputer guy says there's no memory leak. The five day timings are:

Total runtime                      6463.915109   6464.058773   6463.972340      0.025573  1.000     0     0   311  
Initialization                      112.303364    112.379557    112.338285      0.019646  0.017     0     0   311  
Ice                                1060.661444   1103.794030   1085.367246     10.971294  0.168     1     0   311  
Ice Fast                              1.193733     58.242926     14.872363     16.672456  0.002    11     0   311  
Ice Slow                           1038.778091   1102.290151   1070.494408     16.074602  0.166    11     0   311  
Ice Fast/Slow Exchange                0.000034      0.001280      0.000239      0.000376  0.000    11     0   311  
Ocean Initialization                 51.533066     51.632027     51.578070      0.019749  0.008    11     0   311  
Ocean                              5107.741648   5107.912303   5107.797255      0.020502  0.790     1     0   311  
Ocean dynamics                     2487.252234   4439.233302   2918.404205    300.557352  0.451    11     0   311  
Ocean thermodynamics and tracers    520.744444   2445.444956   1986.981465    288.647034  0.307    11     0   311
ATM                                1129.775461   1171.085470   1161.544225     11.805406  0.180     0     0   311
kshedstrom commented 3 years ago

I now think I was on a bad node or something. This is one old supercomputer.

Hallberg-NOAA commented 3 years ago

Given that we have a plausible attribution of the slow timing of the one run to a bad compute node, I think that this new ridging code is ready for a PR to dev/gfdl.