Closed kshedstrom closed 3 years ago
Fascinating difference after only 10 days. We may be able to get the cost down. Can we estimate how much of the ridging overhead is time spent in the Icepack routines itself (of which we have little control)? It may be poorly load balanced as a result of the ridging as well.
The cost of the sea-ice should be small in comparison to the ocean model. Is that not the case with the ridging?
The old run took 52 hours to run for six months. The current run has gone that long and is only 17 days into it. Yes, I can imagine that there could be load-balancing issues.
This certainly is a dramatic difference, and it was moving in the direction we were looking for, with increased wintertime leads.
The performance difference is shocking. It is hard to imagine why the case with ridging would be 3 times as expensive as the entire model was before. However, we do have timers that we can use (or add) to track down where all that time is going.
It's more like a factor of ten more costly. How would you go about turning on all the timers?
What are the timings reported from the coupler level? At the end of stdout, you will see min/max timings for ATM/ICE/OCN .
The timings for one day aren't that scary. The six-month run is still going and will run out of cpu time before it finishes. One day:
Total runtime 1402.299702 1402.406452 1402.346276 0.020759 1.000 0 0 311
Initialization 48.930966 49.025146 48.976529 0.020481 0.035 0 0 311
Ice 186.555966 191.133310 188.974347 1.116731 0.135 1 0 311
Ice Fast 0.202755 7.066391 1.883114 1.906944 0.001 11 0 311
Ice Slow 183.853468 190.013848 187.091070 1.649962 0.133 11 0 311
Ice Fast/Slow Exchange 0.000006 0.001423 0.000118 0.000358 0.000 11 0 311
Ocean Initialization 76.210434 76.370668 76.260599 0.025714 0.054 11 0 311
Ocean 942.653283 942.712896 942.699539 0.012268 0.672 1 0 311
Ocean dynamics 464.256399 823.788685 550.768139 58.162590 0.393 11 0 311
Ocean thermodynamics and tracers 97.058084 453.055836 359.518532 55.929550 0.256 11 0 311
Ocean Other 15.253065 20.663595 17.301125 1.142110 0.012 11 0 311
ATM 242.163030 246.963261 245.869279 1.293262 0.175 0 0 311
Well that looks pretty darn good in terms of being load balanced. So it should be taking ~70 hours/6 months with 312 cores. How long was a 6 month run taking before using the ridging? We can't be talking more than a 20% increase from the ridging, unless there is something else going on which degrades the performance over time ... Do you want me to try a test on Gaea?
Are you using a different number of cores for your production run?
I'm using 312 cores for the production run, same as the short run. We don't have a huge cluster.
Memory leak?
Seems like a likely explanation. Can you tell?
No. Sounds like we should run a longer test on Gaea.
I'm running a five-day test now. I've asked the guys if they can tell how much memory is being used - no answer yet.
The supercomputer guy says there's no memory leak. The five day timings are:
Total runtime 6463.915109 6464.058773 6463.972340 0.025573 1.000 0 0 311
Initialization 112.303364 112.379557 112.338285 0.019646 0.017 0 0 311
Ice 1060.661444 1103.794030 1085.367246 10.971294 0.168 1 0 311
Ice Fast 1.193733 58.242926 14.872363 16.672456 0.002 11 0 311
Ice Slow 1038.778091 1102.290151 1070.494408 16.074602 0.166 11 0 311
Ice Fast/Slow Exchange 0.000034 0.001280 0.000239 0.000376 0.000 11 0 311
Ocean Initialization 51.533066 51.632027 51.578070 0.019749 0.008 11 0 311
Ocean 5107.741648 5107.912303 5107.797255 0.020502 0.790 1 0 311
Ocean dynamics 2487.252234 4439.233302 2918.404205 300.557352 0.451 11 0 311
Ocean thermodynamics and tracers 520.744444 2445.444956 1986.981465 288.647034 0.307 11 0 311
ATM 1129.775461 1171.085470 1161.544225 11.805406 0.180 0 0 311
I now think I was on a bad node or something. This is one old supercomputer.
Given that we have a plausible attribution of the slow timing of the one run to a bad compute node, I think that this new ridging code is ready for a PR to dev/gfdl.
Old ice concentration:
Ridged ice concentration:
Old ice thickness:
Ridged ice thickness:
Is this what it is supposed to look like? These are all ten days into a run. Especially annoying is that it is extremely slow compared to the old code. @MJHarrison-GFDL @Hallberg-NOAA