FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
366 stars 115 forks source link

Running Fluidity with valgrind #281

Closed gnikit closed 4 years ago

gnikit commented 4 years ago

Issue description

Hi guys,

I recently had to track down an extremely annoying bug only showing on Jenkins, in parallel, with load balancing, on our OpenMP branch, in our spatial adapt algorithm (in FETCH, our spatial adapts looks a lot like the interpolation_metric module). I ended up using valgrind on unit tests to check for memory leaks only to find the logs full of supposed errors. I went back to basics and run a Fluidity unit test (test_qsort), with enabled debugging in the configure, to discover the exact same thing.

Does anyone have any idea what are all these errors in the log file?

Steps to replicate

  1. Configure and build Fluidity with ./configure --enable-2d-adaptivity --enable-debugging && make clean && make all -j8
  2. Use default PPA PETSc version (no debug symbols) <-- This probably does not help valgrind
  3. Run make build_unittest (FYI gmsh2triangle breaks when converting cube_prismatic.msh and cube_unstructured.msh because gmsh exports the .msh files into binary instead of ASCII, issue for another time!)
  4. cd bin/tests/ and run
    valgrind --tool=memcheck \
              --leak-check=full \
              --show-leak-kinds=all \
              --track-origins=yes \
              --verbose \
              --log-file=valgrind-qsort.log \
              ./test_qsort

Should produce a similar log: valgrind-qsort.log

stephankramer commented 4 years ago

These appear to be mostly inside the python interpreter. Getting valgrind to play nicely with a linked python interpreter is hard, as python does its own memory management that sometimes circumvents malloc. I've never bothered, but I believe to get a clean log you need to either rebuild your python with very specific options and/or have a very aggressive valgrind suppresion file. The way I use valgrind with fluidity is to just search for any .F90 in the backtrace (anything inside python usually has a long enough backtrace that it's cut off where it goes from fortran->c->python). Since there is no F90 in your valgrind log, I would conclude there's no (valgrind detectable) memory leak in the fortran code. There's also some stuff inside mpi, for which there is a similar story. Debian/Ubuntu seem to have a /usr/share/openmpi/openmpi-valgrind.supp that might help surpress some of these (there is similar stuff for python in /usr/lib/valgrind/ but I think these are being applied automatically already). Since it's all in mpi_init, again I wouldn't worry about it in terms of memory leaks.

gnikit commented 4 years ago

Thanks a lot for the feedback @stephankramer, I will do just that and ignore all that is not *.F90. It should definitely help narrow things down in more realistic examples.

Since we are on the topic of memory leaks, if an *.err-* file at the end of a run contains messages like the ones pasted below, does that mean we are leaking memory?

 tensor_field ErrorMetric has reference count 1
 and id 19
 mesh_type SurfaceCoordinateMesh has reference count 1
 and id 37
 element_type  has reference count 2
 and id 36
 element_type  has reference count 1
 and id 35
 quadrature_type  has reference count 3
 and id 17
 csr_matrix CoordinateMeshFaceList has reference count 1
 and id 25
 csr_sparsity EEListSparsity has reference count 2
 and id 57
 halo_type CoordinateMeshMaximalElementHalo has reference count 1
 and id 44
 halo_type CoordinateMeshMaximalElementHalo has reference count 1
 and id 43
 csr_sparsity NEListSparsity has reference count 1
 and id 55
 mesh_type CoordinateMesh has reference count 1
 and id 35
 halo_type CoordinateMeshLevel2Halo has reference count 1
 and id 41
 halo_type CoordinateMeshLevel1Halo has reference count 1
 and id 40
 tensor_field ErrorMetric has reference count 1
 and id 13
 mesh_type SurfaceCoordinateMesh has reference count 1
 and id 21
 element_type  has reference count 2
 and id 21
 element_type  has reference count 1
 and id 20
 quadrature_type  has reference count 3
 and id 10
 csr_matrix CoordinateMeshFaceList has reference count 1
 and id 13
 csr_sparsity EEListSparsity has reference count 2
 and id 30
 halo_type CoordinateMeshMaximalElementHalo has reference count 1
 and id 26
 halo_type CoordinateMeshMaximalElementHalo has reference count 1
 and id 25
 csr_sparsity NEListSparsity has reference count 1
 and id 28
 mesh_type CoordinateMesh has reference count 1
 and id 19
 halo_type CoordinateMeshLevel2Halo has reference count 1
 and id 23
 halo_type CoordinateMeshLevel1Halo has reference count 1
 and id 22
 tensor_field ErrorMetric has reference count 1
 and id 7
 tensor_field ErrorMetric has reference count 1
 and id 5
 csr_sparsity NEListSparsity has reference count 1
 and id 3
 halo_type CoordinateMeshMaximalElementHalo has reference count 2
 and id 3
 halo_type CoordinateMeshLevel2Halo has reference count 1
 and id 2
 halo_type CoordinateMeshLevel1Halo has reference count 1
 and id 1
 mesh_type SurfaceCoordinateMesh has reference count 1
 and id 2
 element_type  has reference count 2
 and id 3
 element_type  has reference count 1
 and id 2
 quadrature_type  has reference count 3
 and id 2
 csr_matrix CoordinateMeshFaceList has reference count 1
 and id 1
 csr_sparsity EEListSparsity has reference count 2
 and id 2
 csr_sparsity NEListSparsity has reference count 1
 and id 1
 mesh_type CoordinateMesh has reference count 2
 and id 1
 quadrature_type  has reference count 3
 and id 1
 Current memory usage in bytes:
          TotalMemory                 209740.
          MeshMemory                   83308.
          ScalarFieldMemory                0.
          VectorFieldMemory                0.
          TensorFieldMemory            47104.
          MatrixSparsityMemory         58880.
          MatrixMemory                 20448.
          TransformCacheMemory             0.
stephankramer commented 4 years ago

Potentially, in particular when the reference count increases with multiple adapts.

So femtools implements reference counting for its main object. For instance:

call allocate(sfield, mesh)  ! sfield starts out with one reference
call insert(state1, sfield)  ! it now has 2 references: sfield and state1
call instert(state2, sfield) ! refcount is 3 now

If we now call deallocate(sfield) it doesn't actually deallocate the memory, it just decreases the refcount from 3 to 2, because it knows there are still two references to that field left. Only when the last reference is deallocate (for instance by calling deallocate_state on both states), the memory is actually deallocated.

Now in fluidity the assumption is that all fields and meshes are stored in state, and not for instance in some (global) module variable. So what you're meant to be doing whenever you call a subroutine that creates a new field/mesh/etc. is to insert in state, and then deallocate your own reference. At the end of a run we call deallocate_state() and at that point all objects should have no other references.

During an adapt there is a similar check: at some point we want to deallocate the old states containing all fields defined on the old meshes - this is also precisely why we need to have all fields and meshes stored centrally in state. However at that point we have already interpolated some of these fields to the new states on the new mesh. So what we do before we allocate any new fields/meshes/etc. is call tag_references which tags all current references associated with the old mesh, as we expect these to be deallocated when we eventually call deallocate_state on the old states. So that's what's printed every adapt by calling print_tagged_references.

gnikit commented 4 years ago

One again, @stephankramer thank you so much for taking the time to write such a detailed response. It has been of great help and I really appreciate it.

After you comment I am pretty sure I know where to look in FETCH to sort out what is actually happening with all these reference tags printed.

I'll be closing this issue now.

Thanks!