cmbant / CosmoMC

MCMC parameter sampling code
https://cosmologist.info/cosmomc/
82 stars 68 forks source link

Leak on Info pointer from class TTheoryIntermediateCache #35

Open eatdust opened 3 years ago

eatdust commented 3 years ago

Hi there, tested under gfortran-10.3.0, this "Info" pointer is leaking at each Cls evaluation.

From Calculator_CAMB.f90:

``

subroutine CAMBCalc_GetNewTransferData(this, CMB,Info,Theory,error)
...
class(TTheoryIntermediateCache), pointer :: Info
...
allocate(Info, source=this%DefaultInstance)

``

A possible fix is to keep the same all along the runs with:

if (.not.associated(Info)) allocate(Info, source=this%DefaultInstance)

or, maybe, writing an explicit deallocation in the Clear procedure, which is currently empty?

In GeneralType.f90

``

subroutine TTheoryIntermediateCache_Clear(Info)
class(TTheoryIntermediateCache) Info

end subroutine TTheoryIntermediateCache_Clear

``

Dunno what is best!

cmbant commented 3 years ago

Probably https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94109 ? Have you tried other gcc versions?

eatdust commented 3 years ago

I have just checked, gfortran-10.3.0 has the fix of #94109 included. Plus, I do not see any leak on CAMB, which suggests that #94109 is fixed for good, hopefully. I'll try to reduce it though, it is not very clear to me where it gets explicitly or implicitly freed.

cmbant commented 3 years ago

OK thanks. It should be freed in TTheoryIntermediateCache. CAMBdata stored in info only has allocatable components which should be freed automatically (in recent CAMB versions).

eatdust commented 3 years ago

Making progress, Info is in fact already associated when entering the CAMBCalc_GetNewTransferData routine the first time. This could explain the leak, the explicit allocate then create a new copy of the Info pointer. So, if I remove completely the allocation, everything works fine as well. Could it be that it is already allocated in CAMBcalc_SetCurrentPoint?

cmbant commented 3 years ago

Hmm not sure. The idea is that when sampling, you make a new point by coping the current one. This new point has a copy of the original data pointer. So in currentPoint, except at the start, Info should already be assigned to the same info as the old point. GetNewTransferData should then make a new instance for the new point's calculation. But it should then be freed consistently if the point is rejected. Do you find memory grows indefinitely?

eatdust commented 3 years ago

Do you find memory grows indefinitely?

Exactly! And quite fast, after a thousand calls to CAMB, this is game-over :)

cmbant commented 3 years ago

Can you reproduce e.g. with ifort or other gfortran version? I would have though a general issue like that would have been raised long ago (though I've not been using cosmomc myself).

eatdust commented 3 years ago

Yes, with gfortran-8.5.0. I was able to reproduce. With the unstable branch of gfortran-11.1.1 20210612, that's actually untestable as I am getting a SIGSEGV at start. This is unrelated though, the SIGSEV error in __fileutils_MOD_writeitemtxt (../camb/forutils/FileUtils.f90:1572).

With ifort 2021.2.0 20210228, it looks fine, memory grows for a while and then it is freed.

I'll be happy to add an explicit deallocate() of the leaking pointer for gfortran, but it is really not clear where it should be freed!

cmbant commented 3 years ago

Hmm, I wonder if a new gfortran memory leak bug has been introduced.

The CAMB mem test passes with trunk GNU Fortran (GCC) 12.0.0 20210613 https://travis-ci.com/github/cmbant/CAMB/jobs/515550950 so probably not directly a problem with CAMB dynamic allocatables.

If it were a CosmoMC leak, it should presumably also be there with ifort.

eatdust commented 3 years ago

Yes, I agree, I've passed CAMB under valgrind, it is clean with gfortran.

I'll try to understand in more details what is going on with this Info pointer. Removing the allocate stops the leak and I do not see any difference in the behaviour of the chains, with or without. It seems that gfortran makes a spurious extra copy each time its encounter this allocate statement. Tough to solve.

cmbant commented 3 years ago

I certainly can't reproduce a gfortran issue on trunk in simple test cases. Perhaps count calls to deallocate and allocate of Info and see if getting imbalanced?

cmbant commented 3 years ago

I don't see any very obvious memory issue in 8.5.0 using defaults with test.ini changed to action=0.

eatdust commented 3 years ago

Thank you for looking at it! I've just made the same test, with action=0 on test.ini and with 10.3.0, it looks also fine, the memory remains stable!

In my runs, I was using tensors and plik, so I've made another test with test.ini (and also test_planck.ini) having uncommented

compute_tensors = T param[r] = 0.03 0 2 0.04 0.04

And there, I do see the memory growing slowly (compiled with -fopenmp and running on 48 cores, no MPI). It grows slow, but once used with -DMPI, it became rapidly problematic on mpirun -np 64.

Anyway, good catch, I'll try to narrow it down if this is really related to tensors.

cmbant commented 3 years ago

No sure why tensors could be leaking memory - did you find anything else?

Why 64 processes - sounds way too much, or is this using polychord plugin or something else?

eatdust commented 3 years ago

I've been busy and did not have time to make more progress, but I will.

For 64, I am not looking for getting margestats only, I generate millions of sample to train a NN for other matters, and multiplying the number of chains, and running cosmomc a long time, is doing the job very well, provided I can keep the memory under control :)