Potential Memory Leak during RemakeLevel with AmrCore

rhouim commented 4 years ago

Hi,

We have built multiphase reactive flow solver based on AmrCore which has been working great for us. A couple of our users doing larger production runs have run into memory issues. I fixed a few memory leaks and “trimmed the fat” on our MultiFabs by removing unnecessary ghost cells. However, the larger memory leak issue still remains.

I investigated this and found that memory usage increases tremendously during the call to RemakeLevel in Regrid, much faster than a linear growth rate with the number of cells. Specifically, the FillPatch operation that transfers data from the old BoxArray and DistributionMapping to the new layout seems to be the culprit. The FillPatch and RemakeLevel routines that we implemented are based on those in MAESTROeX. In RemakeLevel we define a temporary multifab using the new grid, use FillPatch to transfer data to the temp. multifab from the old grid, then swap the multifabs.

The growth rate of the memory seems to be dependent on the size of the data that is transferred between processors in the call to FillPatch in RemakeLevel. We have a 3D test case where we wrote a checkpoint file at step 699 and then compared the memory usage at several time steps (720, 735, and 750). The calculation used 5 levels of refinement with a refinement ratio of 2. The calculation was run on a single node using pure MPI with 64 processors. We used intel/openmpi4 with debug=false. We checked the memory usage on that node crudely by sshing into that node and running top. We did verify that the behavior is the same if we use gnu8, intel with mpich3 or openmpi4 in debug mode or production mode. We also ran an octree-like grid with a blocking factor of 32 and a max grid size of 16 to make analysis of the memory a little simpler.

The results of the total memory usage are shown in the table below. (CP is an abbreviation for checkpoint).

step	num cells	tot mem (GB)	tot mem with CP (GB)	mem/cell (kB/cell)	mem/cell with CP (kB/cell)	tot mem w/ghost cells and CP (GB)
699	22.1M	20.3	8.72	0.963	0.414	8.67
720	24.7M	24.2	13.7	1.03	0.581	26.8
735	26.6M	29.3	19.0	1.16	0.748	32.0
750	28.2M	33.8	23.3	1.26	0.868	47.0

The table shows memory usage decreases by over a factor of 2 by reading in the checkpoint data and that the growth rate of the memory usage is larger than the increase in the number of cells. The jumps in memory occur only during calls to regrid. The memory does not permanently increase at any other portion of the code. We can run through a bazillion time steps, plot outputs, checkpoint file writing, ghost cell filling, etc. and the total memory usage only slightly fluctuates if the grid never changes (i.e, RemakeLevel is never called).

The growth in the memory seems to be tied to the amount of data transferred in RemakeLevel during the FillPatch operation. The only difference in the calculations between the 4th and 7th columns is the MultiFab used to transfer data to the new grid in RemakeLevel. The 4th column transferred a MultiFab (“old” conserved variables) that did not have ghost cells to the new grid while the 7th column transferred a MultiFab (“new” conserved variables) that included 4 ghost cells. The jump in memory between steps 720 and 699 for the case that included ghost cells in RemakeLevel is ~3.5 times higher than the case that did not include ghost cells. The factor of 3.5 is a little larger than the ratio of the memory of 16x16x16 boxes with 4 ghost cells to a boxes without ghost cells.

The growth rate in memory is also dependent on the number of processors. The growth rate is somewhat reduced if we use 16 processors rather than 64. The memory growth rate follows the growth rate in the number of cells if the calculation is ran serially.

This is a vexing problem for us and we are not sure how to fix it as we are following what is done in MAESTROeX. Any help would be appreciated.

Thanks.

WeiqunZhang commented 4 years ago

Could you add something like below inside your time stepping loop? You can also add it around Remakelevel.

#ifdef AMREX_MEM_PROFILING
#include <AMReX_MemProfiler.H>
MemProfiler::report("a string");
#endif

Could you then compile your code with MEM_PROFILE=TRUE? When you run, you should get a file named memlog. Could you post your memlog here?

rhouim commented 4 years ago

Thanks for taking a look. Attached please find the memlog. I placed calls to MemProfiler::report at the start of the main routine that integrates the governing equations for each time step ("time step"), at the start of RemakeLevel ("Remake Level Start"), and at the end of RemakeLevel("Remake Level End").

The parameters of the calculation are the same as column 4 in the table: Read in the checkpoint at step 699 and ran the calculation to 750 steps.

Please let me know if you the memory reported at any other location in the code or any other information.

memlog.txt

WeiqunZhang commented 4 years ago

It looks to me the memory spike happens after the end of RemakeLevel. For example,

Remake Level Start Memory Profile Report Across Processes:
...
 * Proc VmPeak          VmSize               VmHWM                VmRSS        
   [ 1285 ... 1344  MB] [ 1254 ... 1338  MB] [  185 ... 243   MB] [  155 ... 238   MB]
...
Remake Level End Memory Profile Report Across Processes:
 * Proc VmPeak          VmSize               VmHWM                VmRSS     
   [ 1285 ... 1344  MB] [ 1254 ... 1338  MB] [  185 ... 243   MB] [  155 ... 238   MB]
...
time step Memory Profile Report Across Processes:
...
 * Proc VmPeak          VmSize               VmHWM                VmRSS                     
   [ 1291 ... 1365  MB] [ 1291 ... 1365  MB] [  191 ... 264   MB] [  191 ... 264   MB]

Note that VmRSS increased from 238 to 264.

The memory in AMReX objects (e.g., MultiFab) seems normal to me. Could you look carefully between the end of RemakeLevel and the start of next time step? Maybe the new level forced the recreation of some objects, but the old ones were not deleted.

WeiqunZhang commented 4 years ago

Have you run valgrind? Something like mpiexec -n 4 valgrind --leak-check=yes --track-origins=yes --log-file=vallog.%p ./foo.exe ...

rhouim commented 4 years ago

I re-ran the memory profile closer near a regrid that produced a large spike. I also instrumented the heck out of the code.

I made an extra call to FillPatch after regrid and before the first time step. The memprofile before and after the first FillPatch are below. I only show the peak VmRSS beasue the minimum increases similarly.

Before First FillPatch	After First FillPatch
125 MB	155 MB

There is quite a jump, but I assume that some data structures are being setup with the first call to FillPatch.

The code then ran through a single time step. The memory before and after show a slight increase in memory. Likely indicating a small memory leak. Note that each time step is a full Strang split with several FillPatch and AverageDown calls.

Before first time step	After first time step
155 MB	156 MB

Then a Regrid where the grid structure changed was performed. I assume the small jump between Regrid Start and RemakeLevel Start is mainly due to the new BoxArray and DistributionMapping. There is a fairly large change in memory between the state and end of RemakeLevel. There is another jump after rebuilding the second copy of the conserved variable MultiFabs (consNew) at the end of Regrid after the call the AmrCore::regrid. Note that the MultiFab being rebuilt in RemakeLevel does not have ghost cells while consNew has 4 ghost cells. Thus, the jump in memory due to larger MultiFabs after regrid should be ~3.3 times lower in RemakeLevel than in rebuilding consNew because all Boxes are 16x16x16 w/o ghost cells. However, the jump in memory usage during RemakeLevel is nearly double that while rebuilding consNew.

Regrid Start	RemakeLevel Start	RemakeLevel End	Before Rebuild consNew	After Rebuild consNew	Regrid End
156 MB	158 MB	190 MB	190 MB	208 MB	208 MB

Then 3 more time steps, a second regrid, and a final time step were performed. The memory is essentially flat for all of these steps. The second regrid did not change the grid, so the memory was unchanged. There is likely small memory leak in the time stepping routine around a few hundred KB/step, but nowhere near what is occuring during Regrid when the grid changes.

time step 2	time step 3	time step 4	regrid 2	final time step
208 MB	208 MB	209 MB	209 MB	209 MB

Let me know if you need any other information.

I am running Valgrind right now, but that will take awhile to run and go through the files. I'll get back to you tomorrow on that.

WeiqunZhang commented 4 years ago

@zingale @ajnonaka @doreenfan @harpolea Have you noticed any similar issues with MAESTROeX?

rhouim commented 4 years ago

Valgrind did not turn up anything useful. Just some small stuff during initialization and finalization.

I did some more poking around and instrumented the code even more finely. I also added a call to setVal for the temporary MultiFab in RemakeLevel to separate the portion of memory jump due to the tempory MultiFab that is setup in RemakeLevel and what is happening elsewhere. Again, the table shows the max VmRSS

RemakeLevel Start	After setVal in temp MultiFab	Before FillPatch	After FillPatch	RemakeLevel End
157 MB	168 MB	168 MB	190 MB	190 MB

The jump in VmRSSdue to the new Fab and explicitly setting its values is roughly 11 MB while the jump during FillPatch is 22 MB. Note: when I removed the call to setVal the entirety of the memory jump in RemakeLevel is in FillPatch presumably because sets the data values in the new MultiFab.

I then instrumented FillPatch and found that the entirety of the memory jump is in FillPatchTwoLevels. Following the proverbial rabbit hole, I instrumented FillPatchTwoLevels and found the memory jump is produced during the call to FillPatchSingleLevel within FillPatchTwoLevels. Finally, I instrumented FillPatchSingleLevel and found that entirety of the memory jump is in the call to the ParallelCopy, in the branch smf.size() == 1 (i.e.:

      mf.ParallelCopy(*smf[0], scomp, dcomp, ncomp, IntVect{0}, nghost, geom.periodicity());

Its almost is if there is some temporary data that is allocated during ParallelCopy., but not released from memory.

I also checked out the memory jump from the first call to FillPatch in the above post, which is operating on a MulitFab that contains ghost cells prior to the main time step loop. The routine that produces the memory jump in that case is FillBoundary in FillPatchSingleLevel.

Below I have the salient portions of RemakeLevel, FillPatch, and GetData from our code. Note that we do not subcycle in time, so GetData only returns a single MultiFab and the value of the time vector is the value of time in the function argument.

RemakeLevel

    const int nc_c  = cons_old[lev].nComp();
    const int ng_o  = cons_old[lev].nGrow();

    MultiFab o_state(ba, dm, nc_c, ng_o);

    MemProfiler::report("Remake Level Before Set Val");
    o_state.setVal(1.); // Explicitly set the data in new Fab.  Remove after issue fixed. 

    MemProfiler::report("Remake Level FillPatch Start");
    FillPatch(lev, t_new, o_state, 0, 0, nVar, 0, fill_variable::cons, false ); // Fill o_state from cons_old
    MemProfiler::report("Remake Level FillPatch End");

    cons_old[lev].clear(); //This doesn't do anything for the Memory.  Remove after issue fixed. 
    std::swap(o_state, cons_old[lev]);

    MemProfiler::report("Remake Level After Swap");

FillPatch

 FillPatch (int lev, Real time,
                   MultiFab& mf,
                    int srccomp, int destcomp, int ncomp, int startbccomp,
                    int variable_type, bool new_variable)

       if (lev == 0) {
           \\ The usual lev == 0 code is here
       } else {

        Vector<MultiFab*> cmf, fmf;
        Vector<Real> ctime, ftime;

        // Get the data
        MemProfiler::report("FillPatch Start");
        GetData(lev-1, time, cmf, ctime, variable_type, new_variable);
        GetData(lev, time, fmf, ftime, variable_type, new_variable);

        PhysBCFunct<BndryFuncArray> cphysbc;
        PhysBCFunct<BndryFuncArray> fphysbc;
        Interpolater* mapper;

        MemProfiler::report("FillPatch Before Get physbcs");
        // Get the bcs and interpolation function on the coarse and fine levels
        switch(variable_type) {
        case fill_variable::cons :
               bcs = cons_bcs;
                cphysbc.define(geom[lev-1],bcs,BndryFuncArray(cons_fill));
                fphysbc.define(geom[lev  ],bcs,BndryFuncArray(cons_fill));
                mapper = &cell_cons_interp; //&cell_bilinear_interp;
                break;
        Similar cases for other variables. 
        }

        MemProfiler::report("FillPatch Before FillPatchTwoLevels");
        // Fill the ghost cells
        FillPatchTwoLevels(mf, time, cmf, ctime, fmf, ftime,
                        srccomp, destcomp, ncomp, geom[lev-1], geom[lev],
                        cphysbc, 0, fphysbc, 0, refRatio(lev-1),
                        mapper, bcs, 0); 
   }
}

GetData

 GetData (int lev, Real time,
                  Vector<MultiFab*>& mf,
                  Vector<Real>& mftime,
              int variable_type,
          bool new_variable)
{

    mf.clear();
    mftime.clear();
    mftime.push_back(time);

    switch(variable_type) {
    case fill_variable::cons :

        if (new_variable){
            mf.push_back(&cons_new[lev]);
        } else {
            mf.push_back(&cons_old[lev]);
        }
        break;
                Similar for other variables...
     }
}

WeiqunZhang commented 4 years ago

Can you run with amrex.v=2 and show me the output on Arenas in the end? I wonder if it's because of The_Pinned_Arena.

rhouim commented 4 years ago

Here is the output of the Arenas:

[The  Pinned Arena] space (MB) allocated spread across MPI: [28 ... 74]
[The  Pinned Arena] space (MB) used      spread across MPI: [0 ... 0]

WeiqunZhang commented 4 years ago

Is this a big run or a small run? Communication routines use the pinned arena. It allocates memory from the system and holds on to it after the communication routines "free" the memory. Next if it is asked for a size that's bigger than what it has, it will allocate from the system again. In principle it's possible if the problem always steadily increases the size, the pinned arena may have to keep asking for new memory from the system and all previous allocations get wasted. So can you do a big run?

rhouim commented 4 years ago

Is there a way to periodically flush the Pinned Arena?

This is an expanding blast problem whose size steadily increases. Initially almost none of the domain is refined. The number of refined cells increases as the blast expands in size. So what you are said fits. The major growth of memory, aside from the larger MultiFabs, seems to be within a call to FillPatch anytime the grid changes. As is evidenced by the first call to FillPatch increases the memory, but subsequent calls in the time step loop do not increase memory. Then, in the first call to Regrid after the checkpoint the grid changes and the memory increases. Further calls to FillPatch do not increase the memory until when the grid changes again.

Some of my users will always be running problems that steadily increase in size. One of our users were running a problem that ran out of memory on the compute nodes on one of the DoD machines. We looked at the processor with the largest number of cells and applied a worst-case scenario for memory (i.e., a bunch of little boxes with ghost cells). The memory usage should have been safe by a significant margin even assuming a factor of 2 to 3 for AMReX overhead.

This problem isn't that large, but I can do a bigger version on one of the DoD machines. Might take me a few days. I'll let you know the results when I get them.

WeiqunZhang commented 4 years ago

Besides the run with the current version, could you do another run with the following change?

--- a/Src/Base/AMReX_FabArrayBase.cpp
+++ b/Src/Base/AMReX_FabArrayBase.cpp
@@ -119,7 +119,8 @@ FabArrayBase::Initialize ()
     if (ParallelDescriptor::UseGpuAwareMpi()) {
         the_fa_arena = The_Device_Arena();
     } else {
-        the_fa_arena = The_Pinned_Arena();
+//        the_fa_arena = The_Pinned_Arena();
+        the_fa_arena = The_Cpu_Arena();
     }

     amrex::ExecOnFinalize(FabArrayBase::Finalize);

If the pinned arena is the problem, this change will fix it. For CPU runs, we don't actually need to use pinned memory. But for GPUs, we do want to. Then we probably have to release small memory chunks back to the system to avoid this kind of problem.

rhouim commented 4 years ago

It looks like that change helped quite a lot. The memory usage after step 705 went from a max VmRSS of 208 MB using The_Pinned_Arena to 165 MB using The_Cpu_Arena.

I restarted the simulation from the start to check how the memory grew with time. The memory at step 699 is now 12.7 GB using The_Cpu_Arena and it was 20.3 GB using The_Pinned_Arena.

WeiqunZhang commented 4 years ago

Great! We will fix this for CPU build first and think about how to handle this for GPU build.

rhouim commented 4 years ago

Awesome! Thanks for looking into this!

I'll put the modified file in our code repo until it is fixed in the AMReX repo for the CPU Build.

AMReX-Codes / amrex

Potential Memory Leak during RemakeLevel with AmrCore #1170