Closed gnikit closed 4 years ago
These appear to be mostly inside the python interpreter. Getting valgrind to play nicely with a linked python interpreter is hard, as python does its own memory management that sometimes circumvents malloc. I've never bothered, but I believe to get a clean log you need to either rebuild your python with very specific options and/or have a very aggressive valgrind suppresion file. The way I use valgrind with fluidity is to just search for any .F90 in the backtrace (anything inside python usually has a long enough backtrace that it's cut off where it goes from fortran->c->python). Since there is no F90 in your valgrind log, I would conclude there's no (valgrind detectable) memory leak in the fortran code. There's also some stuff inside mpi, for which there is a similar story. Debian/Ubuntu seem to have a /usr/share/openmpi/openmpi-valgrind.supp that might help surpress some of these (there is similar stuff for python in /usr/lib/valgrind/ but I think these are being applied automatically already). Since it's all in mpi_init, again I wouldn't worry about it in terms of memory leaks.
Thanks a lot for the feedback @stephankramer, I will do just that and ignore all that is not *.F90
. It should definitely help narrow things down in more realistic examples.
Since we are on the topic of memory leaks, if an *.err-*
file at the end of a run contains messages like the ones pasted below, does that mean we are leaking memory?
tensor_field ErrorMetric has reference count 1
and id 19
mesh_type SurfaceCoordinateMesh has reference count 1
and id 37
element_type has reference count 2
and id 36
element_type has reference count 1
and id 35
quadrature_type has reference count 3
and id 17
csr_matrix CoordinateMeshFaceList has reference count 1
and id 25
csr_sparsity EEListSparsity has reference count 2
and id 57
halo_type CoordinateMeshMaximalElementHalo has reference count 1
and id 44
halo_type CoordinateMeshMaximalElementHalo has reference count 1
and id 43
csr_sparsity NEListSparsity has reference count 1
and id 55
mesh_type CoordinateMesh has reference count 1
and id 35
halo_type CoordinateMeshLevel2Halo has reference count 1
and id 41
halo_type CoordinateMeshLevel1Halo has reference count 1
and id 40
tensor_field ErrorMetric has reference count 1
and id 13
mesh_type SurfaceCoordinateMesh has reference count 1
and id 21
element_type has reference count 2
and id 21
element_type has reference count 1
and id 20
quadrature_type has reference count 3
and id 10
csr_matrix CoordinateMeshFaceList has reference count 1
and id 13
csr_sparsity EEListSparsity has reference count 2
and id 30
halo_type CoordinateMeshMaximalElementHalo has reference count 1
and id 26
halo_type CoordinateMeshMaximalElementHalo has reference count 1
and id 25
csr_sparsity NEListSparsity has reference count 1
and id 28
mesh_type CoordinateMesh has reference count 1
and id 19
halo_type CoordinateMeshLevel2Halo has reference count 1
and id 23
halo_type CoordinateMeshLevel1Halo has reference count 1
and id 22
tensor_field ErrorMetric has reference count 1
and id 7
tensor_field ErrorMetric has reference count 1
and id 5
csr_sparsity NEListSparsity has reference count 1
and id 3
halo_type CoordinateMeshMaximalElementHalo has reference count 2
and id 3
halo_type CoordinateMeshLevel2Halo has reference count 1
and id 2
halo_type CoordinateMeshLevel1Halo has reference count 1
and id 1
mesh_type SurfaceCoordinateMesh has reference count 1
and id 2
element_type has reference count 2
and id 3
element_type has reference count 1
and id 2
quadrature_type has reference count 3
and id 2
csr_matrix CoordinateMeshFaceList has reference count 1
and id 1
csr_sparsity EEListSparsity has reference count 2
and id 2
csr_sparsity NEListSparsity has reference count 1
and id 1
mesh_type CoordinateMesh has reference count 2
and id 1
quadrature_type has reference count 3
and id 1
Current memory usage in bytes:
TotalMemory 209740.
MeshMemory 83308.
ScalarFieldMemory 0.
VectorFieldMemory 0.
TensorFieldMemory 47104.
MatrixSparsityMemory 58880.
MatrixMemory 20448.
TransformCacheMemory 0.
Potentially, in particular when the reference count increases with multiple adapts.
So femtools implements reference counting for its main object. For instance:
call allocate(sfield, mesh) ! sfield starts out with one reference
call insert(state1, sfield) ! it now has 2 references: sfield and state1
call instert(state2, sfield) ! refcount is 3 now
If we now call deallocate(sfield)
it doesn't actually deallocate the memory, it just decreases the refcount from 3 to 2, because it knows there are still two references to that field left. Only when the last reference is deallocate (for instance by calling deallocate_state on both states), the memory is actually deallocated.
Now in fluidity the assumption is that all fields and meshes are stored in state, and not for instance in some (global) module variable. So what you're meant to be doing whenever you call a subroutine that creates a new field/mesh/etc. is to insert in state, and then deallocate your own reference. At the end of a run we call deallocate_state() and at that point all objects should have no other references.
During an adapt there is a similar check: at some point we want to deallocate the old states containing all fields defined on the old meshes - this is also precisely why we need to have all fields and meshes stored centrally in state. However at that point we have already interpolated some of these fields to the new states on the new mesh. So what we do before we allocate any new fields/meshes/etc. is call tag_references
which tags all current references associated with the old mesh, as we expect these to be deallocated when we eventually call deallocate_state
on the old states. So that's what's printed every adapt by calling print_tagged_references
.
One again, @stephankramer thank you so much for taking the time to write such a detailed response. It has been of great help and I really appreciate it.
After you comment I am pretty sure I know where to look in FETCH to sort out what is actually happening with all these reference tags printed.
I'll be closing this issue now.
Thanks!
Issue description
Hi guys,
I recently had to track down an extremely annoying bug only showing on Jenkins, in parallel, with load balancing, on our OpenMP branch, in our spatial adapt algorithm (in FETCH, our spatial adapts looks a lot like the
interpolation_metric
module). I ended up using valgrind on unit tests to check for memory leaks only to find the logs full of supposed errors. I went back to basics and run a Fluidity unit test (test_qsort
), with enabled debugging in the configure, to discover the exact same thing.Does anyone have any idea what are all these errors in the log file?
Steps to replicate
./configure --enable-2d-adaptivity --enable-debugging && make clean && make all -j8
make build_unittest
(FYIgmsh2triangle
breaks when convertingcube_prismatic.msh
andcube_unstructured.msh
becausegmsh
exports the .msh files into binary instead of ASCII, issue for another time!)cd bin/tests/
and runShould produce a similar log: valgrind-qsort.log