GEOS-ESM / GEOSgcm

GEOS Earth System Model GEOSgcm Fixture
Apache License 2.0
36 stars 13 forks source link

Memory leak when replaying to ERA-5 #723

Closed sanAkel closed 2 months ago

sanAkel commented 11 months ago

Description:

I set up a high-res coupled experiment to replay to ERA-5, path on discover: /discover/nobackup/sakella/v11_4_0_e1

It was built/run was on SCU17.

Problem:

I had 2 runs that crashed with identical problem: ran out of memory. See:

Proof of memory leak: grep 'Mem Comm:Used' slurm-26229571.out | less

AGCM Date: 2021/11/30  Time: 21:07:30  Throughput(days/day)[Avg Tot Run]:          6.0          6.0          8.8  TimeRemaining(Est) 120:52:42   41.7% :  30.8% Mem Comm:Used

...
...
AGCM Date: 2021/12/30  Time: 08:52:30  Throughput(days/day)[Avg Tot Run]:        141.1        179.9        200.6  TimeRemaining(Est) 000:05:09  111.9% : 100.0% Mem Comm:Used
AGCM Date: 2021/12/30  Time: 09:00:00  Throughput(days/day)[Avg Tot Run]:        141.1        109.2        127.2  TimeRemaining(Est) 000:05:06  112.1% : 100.0% Mem Comm:Used

The message before exit was:

ORTE has lost communication with a remote daemon.

  HNP daemon   : [[30485,0],0] on node borgm067
  Remote daemon: [[30485,0],2] on node borgm069

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

I don't recall having this problem few months ago when I replayed to ERA5. Unfortunately, I purged that experiment!

Per michael.w.donovan@nasa.gov:

and just to confirm, in the example above, borgm069 crashed and rebooted as a result of the out-of-memory condition (as designed), so the TCP error is a bit misleading - it's because the node was no longer there.

Copying: @GEOS-ESM/gmao-si-team, @GEOS-ESM/adas-gatekeepers, @wmputman.

atrayano commented 11 months ago

Are we positive the memory leak is due to replay? Could you make successful long runs in forecast mode? There is nothing "ERA5" specific in mkIAU. True, we do horizontal regridding and vertical remapping. We will take a look and check if we create new regridder (i.e. ESMF route handle) every time we call mkIAU.

Another (actually quite likely) possibility is that the bundle with the variables we are replaying to, needs to be destroyed more carefully. Although we do call MAPL_FieldBundleDestroy, internally (in MAPL) we do not pass noGarbage=.true., and without that flag ESMF performs the full destroy in ESMF_Finalize during the garbage collection phase of ESMF_VmFinalize. For that matter, similar destroy needs to happen inside MAPL_CFIOReadBundle ...

Time permitting, I'll try to address there issues early next week...

In the mean time, I would suggest to disable WINDFIX. Try to run with REPLAY_WINDFIX: NO in your AGCM.rc. WINDFIX uses global size arrays. I've seen memory related issues in my early tests on SCU17. While this will not fix any memory leaks, it might help reduce the overall memory usage. If this does to help your runs, your best bet (at the moment) is to make shorter runs

sanAkel commented 11 months ago

Thanks @atrayano, you want to see 2 runs:

  1. No Wind fix?
  2. No replay?

Did I get that correct?

atrayano commented 11 months ago

Both. You can do it in a single run. The "no-windfix" was an attempt to help you so that your runs might be able to finish. As far as the no-replay run, it would be a good baseline to see if anywhere else in the coupled system we might have memory leaks. I already know what to change in mkIAU/MAPL to fix or at least reduce the memory leaks there. I am simply not convinced that those are the only two places with memory leaks

sanAkel commented 11 months ago

@atrayano,

As far as the no-replay run, it would be a good baseline to see if anywhere else in the coupled system we might have memory leaks.

Is ✅, see /discover/nobackup/sakella/v11_4_0_e1/slurm-26473439.out. As I see it, forecast (or no-replay) also has its memory footprint growing. See this plot of used memory, both runs are 15-days long, time-step of 7.5 min, I'm plotting what comes from (below example is for the no-replay experiment):

grep 'Mem Comm:Used' /discover/nobackup/sakella/v11_4_0_e1/slurm-26473439.out | cut -c 154-159 > mem_used_26473439

used_mem_aogcm

With that ⬆️, one may quickly come to a conclusion that:

  1. Coupled model, even in forecast mode has a memory leak!
  2. In other words, we always have a leak- forecast or replay!

I'm guessing it is above number 2, see below plots. I found @Jcampbell-8's no-replay and ERA5 replay experiments:

Coupled and uncoupled no-replay

fcst_mem

⬆️ is the reason for my conclusion: We always have a leak- forecast or replay!

Coupled and uncoupled replay

Anyway FWIW, plot comparing used memory for coupled/uncoupled replay runs is here. repl_used

Reg the second run:

Both. You can do it in a single run. The "no-windfix" was an attempt to help you so that your runs might be able to finish.

I didn't because, the experiment duration can be cut and number of segments can be increased as suggested by @wmputman and as how @Jcampbell-8 is running; clearly if the experiments are run for long enough in one segment, all will crash! Indeed there is a leak ...!

sanAkel commented 11 months ago

@atrayano Any comments?

atrayano commented 11 months ago

@sanAkel Thanks for doing the experiments, and the plots! I can clearly see that we have memory leaks. The "dominant" leak seems to be caused by the coupled model, but we cannot ignore the leak caused by replay either. We need to find and fix them. I already found and fixed small leaks in mkIAU (none of my changes is committed yet). Once I have quiet time (hopefully next week) I'll replace MAPL_CFIORead (which definitely causes memory leaks) inside mkIAU with much cleaner layer used pretty much everywhere in MAPL. @mathomp4 is aware of the leak in the coupled model and looking into it too.

sanAkel commented 11 months ago

@sanAkel Thanks for doing the experiments, and the plots! I can clearly see that we have memory leaks. The "dominant" leak seems to be caused by the coupled model, but we cannot ignore the leak caused by replay either. We need to find and fix them. I already found and fixed small leaks in mkIAU (none of my changes is committed yet). Once I have quiet time (hopefully next week) I'll replace MAPL_CFIORead (which definitely causes memory leaks) inside mkIAU with much cleaner layer used pretty much everywhere in MAPL. @mathomp4 is aware of the leak in the coupled model and looking into it too.

Thanks @atrayano, looking forward to your and @mathomp4's fixes.

(I'm on AL- will be turning off GitHub notifications next week.)

bena-nasa commented 11 months ago

@sanAkel Since @mathomp4 is off the week of Christmas and since @atrayano is quite busy, I'll take a look at this. Already confirmed I could reproduce the leak with no replay

sanAkel commented 11 months ago

Thanks @bena-nasa, Jan 2024 first week is fine, I'm also off next week.

bena-nasa commented 11 months ago

I'll let @atryano say what he saw in parallel efforts at intercepting malloc and using gfortran. Here are my observations what I can see from the system memory reporting utilities.

  1. This leak is real. I've been running v11.4.0, on the Milan nodes with intel compiler. At say a c180 AMIP run (so don't need to do replay or coupled to see it) on the same number cores @sanAkel was using (30x36) after 10 days the root memory has grown about 4% but I did disable History and ExtData for this test.
  2. First I tried FV3 standalone to see if that leaked. The answer is no. I ran for like 20 days at c360, memory didn't budge an inch, I truly mean that, the number reported by the "AGCM Date" print did not change, AT ALL, so seems to tentatively absolve fv3 or the MAPL generic layers.
  3. I also tried running ExtDataDriver.x configured to match what the model is doing. Ran that for like 20 days, memory didn't budge. Seems to absolve ExtData
  4. So, what could we do, the signal is small enough that timstep to timestep variations are useless so have to run longer to see the trend. So all I can think of is to just see what else we can disable. First I tried disabling the OGCM with BYPASS_OGCM. Memory use after 10 days was slight less ~0.5% but still up about 4%
  5. So then I thought, well we've cut out the ocean, we can try disabling physics components. And here we have some luck. Disabling chemistry and radiation and poof the memory only jumped by 0.6% over the 10 days. Unfortunately disabling anything beyond radiation and chemistry seems to cause bad things in the model and I was just hit by various segfaults depending what I turn off. Disabling just chemistry resulted in a 1.7 % jump, disabling radiation resulted in a 3.7% jump.
  6. Then I tried disabling phase 2 of fv3, (in addition to everything above), had no effect.
  7. Then I did the opposite. Is there smoking gun in the physics we can see by exercising each component more. So I added an ability to run each physics component run methods multiple times per timestep in a loop and report the system memory before and after. Maybe if we can exercise the physics components a lot that will show some signal. Not sure what the signal here is. I did crazy stuff and set it to execute the run method of the physics components 200 times per timestep! Even that I could not see consistent time-step to timestep increases in the per-loop timers despite the fact that that's like running a day in a single step! Yes after 6 hours the memory had increased 4% according to the "AGCM Date" print from CAP, but the only thing I noticed is that sometimes the memory would increase by 0.1% after moist ran. Unfortunately moist can be disabled without the model being very unhappy...
  8. One other odd bit, the memory leak seems to be independent of the layout. When I ran the "stock" configuration (nothing disabled) on 216 cores, rather than 1080, I saw the same ~4% memory leak

So what can I conclude.

  1. Since fv3 standalone has a flat memory profile not totally unreasonable to expect that a hierarchy driven by MAPL Cap should not leak memory.
  2. ExtData seems an unlikely culprit.
  3. Bypassing the ocean does something, but not much. Bypass radiation and chemistry clearly has an effect, more so chemistry, bypassing fv3 phase 2 does nothing.
  4. I was really hoping that each timestep running the physics components multiple times would so show some signal since disabling them all is not realistic, that would say ah hah, that's the component chewing up memory, but didn't see a strong signal, maybe moist, maybe radiation.
  5. The size of the leak appears independent of layout. I.E running the same experiment on less cores showed roughly the same leak as % node memory the system was reporting.
  6. Bottom line, the biggest thing I can turn off that reduces the leak is chemistry, then radiation. Turning those off results in an elimination of most of the least (but not all).
sanAkel commented 11 months ago

@bena-nasa Thanks for these extensive diagnostics and notes.

Glad to note

This leak is real. I've been running v11.4.0, on the Milan nodes with intel compiler. At say a c180 AMIP run (so don't need to do replay or coupled to see it) on the same number cores @sanAkel was using (30x36) after 10 days the root memory has grown about 4% but I did disable History and ExtData for this test.

Is there any progress/update from you .or. @atrayano .or. @mathomp4: any PR?

After I saw your and @atrayano's replies, I wanted to check with some of my old experiments replaying to MERRA-2. I had used GEOSgcm-v11.1.1 and on sky/cas (as opposed to mil that I use now).

To re-squash earlier question from @atrayano reg forecast/replay

As far as the no-replay run, it would be a good baseline to see if anywhere else in the coupled system we might have memory leaks.

Here is a plot from forecast .and. replay experiments.

forecast_used_mem

replay_used_mem

Clearly

bena-nasa commented 11 months ago

@sanAkel I have no updates. Got sucked in to some other things. I think we are dealing with MULTIPLE leaks. Which makes this even worse.

Even with no replay the model clearly has a leak, as far as I a tell the biggest cause is the radiation and chemistry components. More investigation in these needed.

Replay just makes it worse it seems, it could be mkiau that is causing or that you are just running the same components 1.5x as much, or both.

One thought for the exacerbated leak in replay is that mkiau is still using the old CFIO layer. Atanas says that is leaking and I 100% believe it, it's something I want to desperately get rid of because it's just bad code. It was written before you could use allocatable in derived types so its just an absolute mess of pointers that are probably chewing up memory. I have a drop in replacement for CFIO utilizing a new IO framework used by History/ExtData. I've confirmed that I can indeed just drop it in but have not evaluated the impact of that on memory yet. However, this non-zero diff due to some subtle issues with the transformations, but these are just numerical differences due to how the coordinates for the input lat-lon grid is generated. The new code is the correct thing, so no one can tell me with a straight face that the old code is in any way more correct than the old and I would lose no sleep over the non-zero diff.

Unfortunately others are more sensitive to this. In particular the das workflow also uses mkiau in a different context so I need to test that as there is one subtle issues I need to make sure we don't trip over besides the simple and explanation non-zero diff. (this should be easy since it's mkiau.x and it's a little program with few inputs, but there's no way to test it or at least get an input set without running the whole as workflow, so I'll have to figure this out and test). That's on my radar

As far as a more precise detection of the leak, the signal is so small that any sort of in code memory diagnostics are rather useless on a timestep to timestep basis. You can only see this by looking at the longer signal. At that point the only tool have is to eliminate code during execution and see if the leak goes away. That's tedious and there's only so much you can not do before the model goes haywire. The other tool is Atanas has been looking at intercepting malloc but that's not a magic bullet either.

sanAkel commented 10 months ago

@mathomp4 I accidentally deleted your comment reg cmake option.

Why can't we do that?

mathomp4 commented 10 months ago

@sanAkel Well, it's not that elegant for one thing. But I don't really know how the code currently is or how it would be after changing. An ifdef around a single line is okay, but around a 100 lines becomes less useful!

sanAkel commented 10 months ago

@sanAkel Well, it's not that elegant for one thing. But I don't really know how the code currently is or how it would be after changing. An ifdef around a single line is okay, but around a 100 lines becomes less useful!

Thanks for rephrasing and adding (back) your comments. Sorry about that!

sanAkel commented 10 months ago

All,

Shall I close this issue?

Since reply would be redesigned?

atrayano commented 10 months ago

@sanAkel Please, do not close the issue. This is a good reminder that we have memory leaks. By the way, I am not sure if replay is the "real" culprit. It simply makes it more visible because we run the components more (about 50%) due to extra cycles in the predictor. I see a leak in mkIAU too, but we have memory leaks inside many of the components. I am developing/working with a tools that intercepts the malloc(s) and the free(s) and so far helped me identify few of the leaks. Unfortunatelly these were small. I'll post updates here...

sanAkel commented 10 months ago

Ok, @atrayano, I tend to keep my issues/PR list active- stale ones get closed. Will not close this one.

atrayano commented 10 months ago

@sanAkel I been chasing memory leaks for the past two months, and found many small ones. But today I hit gold. This is a major one, most like is responsible for your issue(s). Ultimately it could be traced to MAPL, and ExtData2G in particular, but was exposed by CO_GridCompMod.F90. If you run GEOS with nbins_CO=10 (the default setting, as per gcm_setup), the leak is very visible. But if you run with nbin_CO=1, there is no obvious leak. By the way, this comes from RC/Chem_Registry.rc). I've already fixed my local MAPL sandbox, and the test shows no leak. I'll create a hotfix PR in MAPL in the next day or so, but until then try lowering nbins_CO to 1 in your experiment. Once we/you confirm that this (or the upcoming MAPL change) helps, you can close the ticket

sanAkel commented 10 months ago

I have a run that I started about an hour ago (6 PM).

Will try your fix tomorrow.

Thanks a lot for this. Appreciate it.

atrayano commented 10 months ago

The proper fix will be in MAPL. But it might take some time before it goes through the testing, review, release, etc. Until then, nbins_CO=1 is your friend. Luckily, Tom and I met with Bill today to share some our findings, and he told us that nbins_CO=1 tracks the CO globally. Anything else, could track CO for specific geographic regions, and this involves masking. This is where the bug was.

sanAkel commented 10 months ago

In the context of coupled modeling, nbins_CO=1 tracks the CO globally. seems to suffice.

sanAkel commented 10 months ago

@atrayano my run I was talking ⬆️ is giving me a scare!

See /discover/nobackup/sakella/free_run_gold_per/slurm-28665381.out for the log file.

Start date is: 2021/06/01, see these 😱 numbers!

AGCM Date: 2021/06/01  Time: 21:07:30  Throughput(days/day)[Avg Tot Run]:         10.2         10.2         44.1  TimeRemaining(Est) 075:25:23   54.9% :  34.7% Mem Comm:Used
 AGCM Date: 2021/06/01  Time: 21:15:00  Throughput(days/day)[Avg Tot Run]:        303.0        151.5        156.0  TimeRemaining(Est) 002:32:02   55.2% :  35.1% Mem Comm:Used
 AGCM Date: 2021/06/01  Time: 21:22:30  Throughput(days/day)[Avg Tot Run]:        244.5        176.5        182.5  TimeRemaining(Est) 003:08:23   55.2% :  35.2% Mem Comm:Used
 AGCM Date: 2021/06/01  Time: 21:30:00  Throughput(days/day)[Avg Tot Run]:        215.8        159.7        164.3  TimeRemaining(Est) 003:33:25   55.2% :  35.2% Mem Comm:Used
 AGCM Date: 2021/06/01  Time: 21:37:30  Throughput(days/day)[Avg Tot Run]:        211.3        195.5        202.3  TimeRemaining(Est) 003:37:52   55.2% :  35.3% Mem Comm:Used
 AGCM Date: 2021/06/01  Time: 21:45:00  Throughput(days/day)[Avg Tot Run]:        207.7        191.6        198.3  TimeRemaining(Est) 003:41:36   55.2% :  35.3% Mem Comm:Used
 AGCM Date: 2021/06/01  Time: 21:52:30  Throughput(days/day)[Avg Tot Run]:        204.3        186.2        192.5  TimeRemaining(Est) 003:45:15   55.2% :  35.3% Mem Comm:Used

and towards the end 🤞 if not crash!

 AGCM Date: 2021/07/02  Time: 12:00:00  Throughput(days/day)[Avg Tot Run]:        166.8        117.7        155.8  TimeRemaining(Est) 000:11:52   98.9% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 12:07:30  Throughput(days/day)[Avg Tot Run]:        166.7         47.6         48.6  TimeRemaining(Est) 000:11:49   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 12:15:00  Throughput(days/day)[Avg Tot Run]:        166.7        198.7        217.8  TimeRemaining(Est) 000:11:47   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 12:22:30  Throughput(days/day)[Avg Tot Run]:        166.7        198.2        217.0  TimeRemaining(Est) 000:11:44   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 12:30:00  Throughput(days/day)[Avg Tot Run]:        166.7        198.9        217.7  TimeRemaining(Est) 000:11:41   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 12:37:30  Throughput(days/day)[Avg Tot Run]:        166.7        198.3        217.3  TimeRemaining(Est) 000:11:39   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 12:45:00  Throughput(days/day)[Avg Tot Run]:        166.7        198.6        217.4  TimeRemaining(Est) 000:11:36   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 12:52:30  Throughput(days/day)[Avg Tot Run]:        166.7        199.2        218.0  TimeRemaining(Est) 000:11:33   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:00:00  Throughput(days/day)[Avg Tot Run]:        166.7        146.1        156.0  TimeRemaining(Est) 000:11:30   99.1% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:07:30  Throughput(days/day)[Avg Tot Run]:        166.7        156.3        167.6  TimeRemaining(Est) 000:11:28   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:15:00  Throughput(days/day)[Avg Tot Run]:        166.7        198.9        218.0  TimeRemaining(Est) 000:11:25   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:22:30  Throughput(days/day)[Avg Tot Run]:        166.7        198.4        217.4  TimeRemaining(Est) 000:11:22   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:30:00  Throughput(days/day)[Avg Tot Run]:        166.7        198.4        217.1  TimeRemaining(Est) 000:11:20   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:37:30  Throughput(days/day)[Avg Tot Run]:        166.7        198.3        217.0  TimeRemaining(Est) 000:11:17   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:45:00  Throughput(days/day)[Avg Tot Run]:        166.8        198.8        217.8  TimeRemaining(Est) 000:11:14   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 13:52:30  Throughput(days/day)[Avg Tot Run]:        166.8        198.2        217.2  TimeRemaining(Est) 000:11:11   98.8% :  85.6% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:00:00  Throughput(days/day)[Avg Tot Run]:        166.8        146.5        156.7  TimeRemaining(Est) 000:11:09   99.1% :  85.7% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:07:30  Throughput(days/day)[Avg Tot Run]:        166.8        155.7        167.0  TimeRemaining(Est) 000:11:06   98.8% :  85.7% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:15:00  Throughput(days/day)[Avg Tot Run]:        166.8        198.8        217.8  TimeRemaining(Est) 000:11:03   98.9% :  85.7% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:22:30  Throughput(days/day)[Avg Tot Run]:        166.8        198.1        216.9  TimeRemaining(Est) 000:11:01   98.9% :  85.7% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:30:00  Throughput(days/day)[Avg Tot Run]:        166.8        197.8        216.6  TimeRemaining(Est) 000:10:58   98.9% :  85.7% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:37:30  Throughput(days/day)[Avg Tot Run]:        166.8        198.2        217.2  TimeRemaining(Est) 000:10:55   98.9% :  85.7% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:45:00  Throughput(days/day)[Avg Tot Run]:        166.8        198.5        217.4  TimeRemaining(Est) 000:10:52   98.9% :  85.7% Mem Comm:Used
 AGCM Date: 2021/07/02  Time: 14:52:30  Throughput(days/day)[Avg Tot Run]:        166.8        198.9        217.9  TimeRemaining(Est) 000:10:50   98.9% :  85.7% Mem Comm:Used

I'm making my own hotfix

discover34.sakella 573> grep 'nbins_CO' RC/Chem_Registry.rc

nbins_CO: 10 # carbon monoxide

nbins_CO: 1 # carbon monoxide

atrayano commented 10 months ago

@sanAkel Matt just released MAPL that has the bug fix (2.43.2) so even if you wanted to run with 10 bins, wont leak memory. But there is a performance penalty running with 10 bins. And by the way, according to Arlindo, nbins_CO=10 was there to support field campaigns, but "normally" everybody should be using 1 bin

sanAkel commented 10 months ago

but "normally" everybody should be using 1 bin

Then that's what the default should be.

cc: @amdasilva