Closed sanAkel closed 2 months ago
Are we positive the memory leak is due to replay? Could you make successful long runs in forecast mode? There is nothing "ERA5" specific in mkIAU. True, we do horizontal regridding and vertical remapping. We will take a look and check if we create new regridder (i.e. ESMF route handle) every time we call mkIAU.
Another (actually quite likely) possibility is that the bundle with the variables we are replaying to, needs to be destroyed more carefully. Although we do call MAPL_FieldBundleDestroy, internally (in MAPL) we do not pass noGarbage=.true., and without that flag ESMF performs the full destroy in ESMF_Finalize during the garbage collection phase of ESMF_VmFinalize. For that matter, similar destroy needs to happen inside MAPL_CFIOReadBundle ...
Time permitting, I'll try to address there issues early next week...
In the mean time, I would suggest to disable WINDFIX. Try to run with REPLAY_WINDFIX: NO in your AGCM.rc. WINDFIX uses global size arrays. I've seen memory related issues in my early tests on SCU17. While this will not fix any memory leaks, it might help reduce the overall memory usage. If this does to help your runs, your best bet (at the moment) is to make shorter runs
Thanks @atrayano, you want to see 2 runs:
Did I get that correct?
Both. You can do it in a single run. The "no-windfix" was an attempt to help you so that your runs might be able to finish. As far as the no-replay run, it would be a good baseline to see if anywhere else in the coupled system we might have memory leaks. I already know what to change in mkIAU/MAPL to fix or at least reduce the memory leaks there. I am simply not convinced that those are the only two places with memory leaks
@atrayano,
As far as the no-replay run, it would be a good baseline to see if anywhere else in the coupled system we might have memory leaks.
Is ✅, see /discover/nobackup/sakella/v11_4_0_e1/slurm-26473439.out
. As I see it, forecast (or no-replay) also has its memory footprint growing. See this plot of used memory, both runs are 15-days long, time-step of 7.5 min, I'm plotting what comes from (below example is for the no-replay experiment):
grep 'Mem Comm:Used' /discover/nobackup/sakella/v11_4_0_e1/slurm-26473439.out | cut -c 154-159 > mem_used_26473439
With that ⬆️, one may quickly come to a conclusion that:
I'm guessing it is above number 2, see below plots. I found @Jcampbell-8's no-replay and ERA5 replay experiments:
CAP.rc
in below path./discover/nobackup/projects/gmao/g6dev/jjcampb2/Replay_Experiments/KM_v11.0.0-rc.1_BACM_HY_L072_C360_RepE5/
⬆️ is the reason for my conclusion: We always have a leak- forecast or replay!
Anyway FWIW, plot comparing used memory for coupled/uncoupled replay runs is here.
Both. You can do it in a single run. The "no-windfix" was an attempt to help you so that your runs might be able to finish.
I didn't because, the experiment duration can be cut and number of segments can be increased as suggested by @wmputman and as how @Jcampbell-8 is running; clearly if the experiments are run for long enough in one segment, all will crash! Indeed there is a leak ...!
@atrayano Any comments?
@sanAkel Thanks for doing the experiments, and the plots! I can clearly see that we have memory leaks. The "dominant" leak seems to be caused by the coupled model, but we cannot ignore the leak caused by replay either. We need to find and fix them. I already found and fixed small leaks in mkIAU (none of my changes is committed yet). Once I have quiet time (hopefully next week) I'll replace MAPL_CFIORead (which definitely causes memory leaks) inside mkIAU with much cleaner layer used pretty much everywhere in MAPL. @mathomp4 is aware of the leak in the coupled model and looking into it too.
@sanAkel Thanks for doing the experiments, and the plots! I can clearly see that we have memory leaks. The "dominant" leak seems to be caused by the coupled model, but we cannot ignore the leak caused by replay either. We need to find and fix them. I already found and fixed small leaks in mkIAU (none of my changes is committed yet). Once I have quiet time (hopefully next week) I'll replace MAPL_CFIORead (which definitely causes memory leaks) inside mkIAU with much cleaner layer used pretty much everywhere in MAPL. @mathomp4 is aware of the leak in the coupled model and looking into it too.
Thanks @atrayano, looking forward to your and @mathomp4's fixes.
(I'm on AL- will be turning off GitHub notifications next week.)
@sanAkel Since @mathomp4 is off the week of Christmas and since @atrayano is quite busy, I'll take a look at this. Already confirmed I could reproduce the leak with no replay
Thanks @bena-nasa, Jan 2024 first week is fine, I'm also off next week.
I'll let @atryano say what he saw in parallel efforts at intercepting malloc and using gfortran. Here are my observations what I can see from the system memory reporting utilities.
So what can I conclude.
@bena-nasa Thanks for these extensive diagnostics and notes.
Glad to note
This leak is real. I've been running v11.4.0, on the Milan nodes with intel compiler. At say a c180 AMIP run (so don't need to do replay or coupled to see it) on the same number cores @sanAkel was using (30x36) after 10 days the root memory has grown about 4% but I did disable History and ExtData for this test.
Is there any progress/update from you .or. @atrayano .or. @mathomp4: any PR?
After I saw your and @atrayano's replies, I wanted to check with some of my old experiments replaying to MERRA-2. I had used GEOSgcm-v11.1.1
and on sky/cas
(as opposed to mil
that I use now).
To re-squash earlier question from @atrayano reg forecast/replay
As far as the no-replay run, it would be a good baseline to see if anywhere else in the coupled system we might have memory leaks.
Here is a plot from forecast .and. replay experiments.
GEOSgcm-v11.1.1
, forecast or replay to MERRA-2 the memory growth is at a comparable pace.@sanAkel I have no updates. Got sucked in to some other things. I think we are dealing with MULTIPLE leaks. Which makes this even worse.
Even with no replay the model clearly has a leak, as far as I a tell the biggest cause is the radiation and chemistry components. More investigation in these needed.
Replay just makes it worse it seems, it could be mkiau that is causing or that you are just running the same components 1.5x as much, or both.
One thought for the exacerbated leak in replay is that mkiau is still using the old CFIO layer. Atanas says that is leaking and I 100% believe it, it's something I want to desperately get rid of because it's just bad code. It was written before you could use allocatable in derived types so its just an absolute mess of pointers that are probably chewing up memory. I have a drop in replacement for CFIO utilizing a new IO framework used by History/ExtData. I've confirmed that I can indeed just drop it in but have not evaluated the impact of that on memory yet. However, this non-zero diff due to some subtle issues with the transformations, but these are just numerical differences due to how the coordinates for the input lat-lon grid is generated. The new code is the correct thing, so no one can tell me with a straight face that the old code is in any way more correct than the old and I would lose no sleep over the non-zero diff.
Unfortunately others are more sensitive to this. In particular the das workflow also uses mkiau in a different context so I need to test that as there is one subtle issues I need to make sure we don't trip over besides the simple and explanation non-zero diff. (this should be easy since it's mkiau.x and it's a little program with few inputs, but there's no way to test it or at least get an input set without running the whole as workflow, so I'll have to figure this out and test). That's on my radar
As far as a more precise detection of the leak, the signal is so small that any sort of in code memory diagnostics are rather useless on a timestep to timestep basis. You can only see this by looking at the longer signal. At that point the only tool have is to eliminate code during execution and see if the leak goes away. That's tedious and there's only so much you can not do before the model goes haywire. The other tool is Atanas has been looking at intercepting malloc but that's not a magic bullet either.
@mathomp4 I accidentally deleted your comment reg cmake option.
Why can't we do that?
@sanAkel Well, it's not that elegant for one thing. But I don't really know how the code currently is or how it would be after changing. An ifdef around a single line is okay, but around a 100 lines becomes less useful!
@sanAkel Well, it's not that elegant for one thing. But I don't really know how the code currently is or how it would be after changing. An ifdef around a single line is okay, but around a 100 lines becomes less useful!
Thanks for rephrasing and adding (back) your comments. Sorry about that!
All,
Shall I close this issue?
Since reply would be redesigned?
@sanAkel Please, do not close the issue. This is a good reminder that we have memory leaks. By the way, I am not sure if replay is the "real" culprit. It simply makes it more visible because we run the components more (about 50%) due to extra cycles in the predictor. I see a leak in mkIAU too, but we have memory leaks inside many of the components. I am developing/working with a tools that intercepts the malloc(s) and the free(s) and so far helped me identify few of the leaks. Unfortunatelly these were small. I'll post updates here...
Ok, @atrayano, I tend to keep my issues/PR list active- stale ones get closed. Will not close this one.
@sanAkel I been chasing memory leaks for the past two months, and found many small ones. But today I hit gold. This is a major one, most like is responsible for your issue(s). Ultimately it could be traced to MAPL, and ExtData2G in particular, but was exposed by CO_GridCompMod.F90. If you run GEOS with nbins_CO=10 (the default setting, as per gcm_setup), the leak is very visible. But if you run with nbin_CO=1, there is no obvious leak. By the way, this comes from RC/Chem_Registry.rc). I've already fixed my local MAPL sandbox, and the test shows no leak. I'll create a hotfix PR in MAPL in the next day or so, but until then try lowering nbins_CO to 1 in your experiment. Once we/you confirm that this (or the upcoming MAPL change) helps, you can close the ticket
I have a run that I started about an hour ago (6 PM).
Will try your fix tomorrow.
Thanks a lot for this. Appreciate it.
The proper fix will be in MAPL. But it might take some time before it goes through the testing, review, release, etc. Until then, nbins_CO=1 is your friend. Luckily, Tom and I met with Bill today to share some our findings, and he told us that nbins_CO=1 tracks the CO globally. Anything else, could track CO for specific geographic regions, and this involves masking. This is where the bug was.
In the context of coupled modeling, nbins_CO=1 tracks the CO globally.
seems to suffice.
@atrayano my run I was talking ⬆️ is giving me a scare!
See /discover/nobackup/sakella/free_run_gold_per/slurm-28665381.out
for the log file.
Start date is: 2021/06/01
, see these 😱 numbers!
AGCM Date: 2021/06/01 Time: 21:07:30 Throughput(days/day)[Avg Tot Run]: 10.2 10.2 44.1 TimeRemaining(Est) 075:25:23 54.9% : 34.7% Mem Comm:Used
AGCM Date: 2021/06/01 Time: 21:15:00 Throughput(days/day)[Avg Tot Run]: 303.0 151.5 156.0 TimeRemaining(Est) 002:32:02 55.2% : 35.1% Mem Comm:Used
AGCM Date: 2021/06/01 Time: 21:22:30 Throughput(days/day)[Avg Tot Run]: 244.5 176.5 182.5 TimeRemaining(Est) 003:08:23 55.2% : 35.2% Mem Comm:Used
AGCM Date: 2021/06/01 Time: 21:30:00 Throughput(days/day)[Avg Tot Run]: 215.8 159.7 164.3 TimeRemaining(Est) 003:33:25 55.2% : 35.2% Mem Comm:Used
AGCM Date: 2021/06/01 Time: 21:37:30 Throughput(days/day)[Avg Tot Run]: 211.3 195.5 202.3 TimeRemaining(Est) 003:37:52 55.2% : 35.3% Mem Comm:Used
AGCM Date: 2021/06/01 Time: 21:45:00 Throughput(days/day)[Avg Tot Run]: 207.7 191.6 198.3 TimeRemaining(Est) 003:41:36 55.2% : 35.3% Mem Comm:Used
AGCM Date: 2021/06/01 Time: 21:52:30 Throughput(days/day)[Avg Tot Run]: 204.3 186.2 192.5 TimeRemaining(Est) 003:45:15 55.2% : 35.3% Mem Comm:Used
and towards the end 🤞 if not crash!
AGCM Date: 2021/07/02 Time: 12:00:00 Throughput(days/day)[Avg Tot Run]: 166.8 117.7 155.8 TimeRemaining(Est) 000:11:52 98.9% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 12:07:30 Throughput(days/day)[Avg Tot Run]: 166.7 47.6 48.6 TimeRemaining(Est) 000:11:49 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 12:15:00 Throughput(days/day)[Avg Tot Run]: 166.7 198.7 217.8 TimeRemaining(Est) 000:11:47 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 12:22:30 Throughput(days/day)[Avg Tot Run]: 166.7 198.2 217.0 TimeRemaining(Est) 000:11:44 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 12:30:00 Throughput(days/day)[Avg Tot Run]: 166.7 198.9 217.7 TimeRemaining(Est) 000:11:41 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 12:37:30 Throughput(days/day)[Avg Tot Run]: 166.7 198.3 217.3 TimeRemaining(Est) 000:11:39 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 12:45:00 Throughput(days/day)[Avg Tot Run]: 166.7 198.6 217.4 TimeRemaining(Est) 000:11:36 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 12:52:30 Throughput(days/day)[Avg Tot Run]: 166.7 199.2 218.0 TimeRemaining(Est) 000:11:33 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:00:00 Throughput(days/day)[Avg Tot Run]: 166.7 146.1 156.0 TimeRemaining(Est) 000:11:30 99.1% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:07:30 Throughput(days/day)[Avg Tot Run]: 166.7 156.3 167.6 TimeRemaining(Est) 000:11:28 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:15:00 Throughput(days/day)[Avg Tot Run]: 166.7 198.9 218.0 TimeRemaining(Est) 000:11:25 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:22:30 Throughput(days/day)[Avg Tot Run]: 166.7 198.4 217.4 TimeRemaining(Est) 000:11:22 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:30:00 Throughput(days/day)[Avg Tot Run]: 166.7 198.4 217.1 TimeRemaining(Est) 000:11:20 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:37:30 Throughput(days/day)[Avg Tot Run]: 166.7 198.3 217.0 TimeRemaining(Est) 000:11:17 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:45:00 Throughput(days/day)[Avg Tot Run]: 166.8 198.8 217.8 TimeRemaining(Est) 000:11:14 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 13:52:30 Throughput(days/day)[Avg Tot Run]: 166.8 198.2 217.2 TimeRemaining(Est) 000:11:11 98.8% : 85.6% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:00:00 Throughput(days/day)[Avg Tot Run]: 166.8 146.5 156.7 TimeRemaining(Est) 000:11:09 99.1% : 85.7% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:07:30 Throughput(days/day)[Avg Tot Run]: 166.8 155.7 167.0 TimeRemaining(Est) 000:11:06 98.8% : 85.7% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:15:00 Throughput(days/day)[Avg Tot Run]: 166.8 198.8 217.8 TimeRemaining(Est) 000:11:03 98.9% : 85.7% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:22:30 Throughput(days/day)[Avg Tot Run]: 166.8 198.1 216.9 TimeRemaining(Est) 000:11:01 98.9% : 85.7% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:30:00 Throughput(days/day)[Avg Tot Run]: 166.8 197.8 216.6 TimeRemaining(Est) 000:10:58 98.9% : 85.7% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:37:30 Throughput(days/day)[Avg Tot Run]: 166.8 198.2 217.2 TimeRemaining(Est) 000:10:55 98.9% : 85.7% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:45:00 Throughput(days/day)[Avg Tot Run]: 166.8 198.5 217.4 TimeRemaining(Est) 000:10:52 98.9% : 85.7% Mem Comm:Used
AGCM Date: 2021/07/02 Time: 14:52:30 Throughput(days/day)[Avg Tot Run]: 166.8 198.9 217.9 TimeRemaining(Est) 000:10:50 98.9% : 85.7% Mem Comm:Used
I'm making my own hotfix
discover34.sakella 573> grep 'nbins_CO' RC/Chem_Registry.rc
nbins_CO: 10 # carbon monoxide
nbins_CO: 1 # carbon monoxide
@sanAkel Matt just released MAPL that has the bug fix (2.43.2) so even if you wanted to run with 10 bins, wont leak memory. But there is a performance penalty running with 10 bins. And by the way, according to Arlindo, nbins_CO=10 was there to support field campaigns, but "normally" everybody should be using 1 bin
but "normally" everybody should be using 1 bin
Then that's what the default should be.
cc: @amdasilva
Description:
I set up a high-res coupled experiment to replay to ERA-5, path on discover:
/discover/nobackup/sakella/v11_4_0_e1
It was built/run was on SCU17.
Problem:
I had 2 runs that crashed with identical problem: ran out of memory. See:
Proof of memory leak:
grep 'Mem Comm:Used' slurm-26229571.out | less
The message before exit was:
I don't recall having this problem few months ago when I replayed to ERA5. Unfortunately, I purged that experiment!
Per michael.w.donovan@nasa.gov:
Copying: @GEOS-ESM/gmao-si-team, @GEOS-ESM/adas-gatekeepers, @wmputman.