Closed paskino closed 4 years ago
Try adding prints at this line to see if the destructor is called at each iteration.
…and sometimes not called at all!
We need to check for cyclic references and perhaps consider using weakref.
BTW which Python are you using?
PS. Ran into a problem with my B&B – had to return to London. Will work at UCLH/home the rest of the week.
From: Edoardo Pasca [mailto:notifications@github.com] Sent: 10 February 2020 18:32 To: CCPPETMR/SIRF Cc: Subscribed Subject: Re: [CCPPETMR/SIRF] memory leak in SIRFReg (#542)
The delhttps://docs.python.org/3/reference/datamodel.html#object.__del__ method is not a destructor in Python. It is a method that is called when the the garbage collector runs.
An interesting referencehttps://eli.thegreenplace.net/2009/06/12/safely-using-destructors-in-python/
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/CCPPETMR/SIRF/issues/542?email_source=notifications&email_token=ADSZ6VF7XPS3CNRQ4OJXFJTRCGMS3A5CNFSM4KQ5CAVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELJXDVI#issuecomment-584282581, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADSZ6VARXWB7IFCGYVIL2N3RCGMS3ANCNFSM4KQ5CAVA.
I run a few tests and the __del__
method gets never called, hence the memory leak.
There may be some cyclic reference.... investigating.
This is Python 2.7 on a VM.
This might have been sorted out in Python 3 - could you try? I understand there are plans to stop Python 2.7 support.
Thanks for looking into this @paskino. Do you know which class is creating the memory leak? Presumably it's the NiftiImageData
or the NiftyResample
? Could be AffineTransformation
, but it seems unlikely that a 4x4 matrix would be eating that much memory.
I'm not sure. I added prints to all the relevant __del__
methods in the various ImageData
classes, i.e. SIRF.DataContainer STIR.ImageData Reg.NiftiImageData...
and made a much simpler test: create and delete an instance of an image.
import os, shutil
import sirf.STIR as pet
from sirf.Utilities import examples_data_path
import numpy
import sirf.Reg as Reg
os.chdir(examples_data_path('PET'))
if True:
shutil.rmtree('working_folder/memorytest',True)
shutil.copytree('brain','working_folder/memorytest')
os.chdir('working_folder/memorytest')
image = pet.ImageData('emission.hv')
i2 = image.copy()
i2 = 0
mat4 = numpy.eye(4)
tm = Reg.AffineTransformation(mat4)
resampler = Reg.NiftyResample()
resampler.set_reference_image(image)
resampler.set_floating_image(image)
resampler.add_transformation(tm)
resampler.set_padding_value(0)
resampler.set_interpolation_type_to_linear()
direct = resampler.direct(image)
direct = 0
adjoint = resampler.adjoint(image)
adjoint = 0
This outputs
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7f4e24ac9750>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7f4e2ed17570>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7f4e2ed17570>.
2 things:
NiftiImageData
appears in the lines above.direct
and adjoint
instances have the same handle. In your output, there is only STIR.ImageData::__del__
, nothing from the registration stuff. Are you sure you put a print into the NiftyResample
and AffineTransformation
deletion methods?
I would also expect there to be 4 deletions: image
, i2
, direct
and adjoint
. Can you figure out which one is missing (or duplicate handle), by commenting them out one-by-one. I would also remove the i2
object, as it adds a confounding factor.
I have put that only in the NiftiImageData NiftiImageData3D
and all the series. Not in the NiftiResample
and AffineTransformation
In the above test, I expect only 3 deletions, as I'm running in a notebook so the interpreter is still alive. I expect the deletion of i2, direct
and adjoint
.
I added i2
to test if the deletion of the STIR.ImageData
was working as expected.
Now, I removed i2
and the output, as run in a script is
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7ff890727c60>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7ff890727c60>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7ff8a479a870>.
The latter is the deletion of image
.
In your example __del__
won't be called for NiftiImageData
or NiftiImageData3D
because the conversion from STIR to NiftiImageData takes place on the C++ side.
I suppose the best thing to do if there's a memory leak is test on the C++ level first. If you're feeling motivated, you could stick a C++ version of your first code into test_cSynergistic.cpp
and run it with Valgrind's memcheck
. If not, I'll do it. Not sure when though.
could/should just run valgrind
on the current ctests. Best results when you'd build SIRF in Debug
or RelWithDebug
(or whatever it's called). Note however that due to https://github.com/CCPPETMR/SIRF-SuperBuild/issues/315, you'll have to force this.
I added prints to NiftyResample
and AffineTransformation
, this the output:
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7fdbe379f390>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7fdbe379f390>.
AffineTransformation __del__ handle <Swig Object of type 'void *' at 0x7fdbe36019f0>
NiftyResample __del__ handle <Swig Object of type 'void *' at 0x7fdbe9b88150>
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7fdbe6c10990>.
I believe that the fact that the handle for direct
and adjoint
is the same is suspicious.
Guys, I have no clue on how to do this. I know names, but that's not my territory. @evgueni-ovtchinnikov ?
me neither
easy:
sudo apt install valgrind
valgrind test_whatever >& logfile
It'll report memory leaks and double deletes etc, some of them likely in the OS, but you don't care about those!
We could even add this in our test-suite. From this post, it seems we simply add
include(CTest)
and then (optionally) do ctest -T memcheck
.
I cannot connect to security.ubuntu.com from my VM. Is this happening on your VirtualBoxes?
sudo apt update
?
I rebuilt SIRF in Debug.
Valgring says:
==15880== LEAK SUMMARY:
==15880== definitely lost: 760 bytes in 10 blocks
==15880== indirectly lost: 11,098,740 bytes in 12,521 blocks
==15880== possibly lost: 517,202 bytes in 434 blocks
==15880== still reachable: 12,977,509 bytes in 6,693 blocks
==15880== suppressed: 0 bytes in 0 blocks
==15880== Reachable blocks (those to which a pointer was found) are not shown.
==15880== To see them, rerun with: --leak-check=full --show-leak-kinds=all
This is an extract of the very verbose output from valgrind that contains a million of lines regarding Python.
==15880== 2,823,440 (96 direct, 2,823,344 indirect) bytes in 1 blocks are definitely lost in loss record 1,040
of 1,043
==15880== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880== by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880== by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880== by 0x1C42F276: sirf::NiftyResample<float>::adjoint(std::shared_ptr<sirf::ImageData>, std::shared_ptr<sirf::ImageData const>) (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880== by 0x1C40301D: cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880== by 0x1C3F2151: _wrap_cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880== 2,627,624 (96 direct, 2,627,528 indirect) bytes in 1 blocks are definitely lost in loss record 1,036
of 1,043
==15880== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880== by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880== by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880== by 0x18A0A957: cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)
==15880== by 0x18A036A1: _wrap_cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)
==15880== 2,823,440 (96 direct, 2,823,344 indirect) bytes in 1 blocks are definitely lost in loss record 1,040
of 1,043
==15880== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880== by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880== by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880== by 0x1C42F276: sirf::NiftyResample<float>::adjoint(std::shared_ptr<sirf::ImageData>, std::shared_p
tr<sirf::ImageData const>) (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880== by 0x1C40301D: cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880== by 0x1C3F2151: _wrap_cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyre
g.so)
==15880== 2,823,440 (96 direct, 2,823,344 indirect) bytes in 1 blocks are definitely lost in loss record 1,039
of 1,043
==15880== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880== by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880== by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880== by 0x1C430876: sirf::NiftyResample<float>::forward(std::shared_ptr<sirf::ImageData>, std::shared_p
tr<sirf::ImageData const>) (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880== by 0x1C402B4D: cReg_NiftyResample_forward (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880== by 0x1C3F22F1: _wrap_cReg_NiftyResample_forward (in /home/sirfuser/devel/install/python/sirf/_pyre
g.so)
==15880== 2,824,284 (96 direct, 2,824,188 indirect) bytes in 1 blocks are definitely lost in loss record 1,041
of 1,043
==15880== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880== by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880== by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880== by 0x18A0A957: cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)
==15880== by 0x18A036A1: _wrap_cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)
sudo apt update
?
it cannot reach security.ubuntu.com
and apt update
fails. For the sake of the exercise I just removed it from the apt sources.
what where you "valgrinding"? It'd be good to do this for the ctests. nice and isolated...
I cannot immediately see a leak on the C++ side of forward
.
I do see that it creates a output_nifti_sptr
which then never seems to be actually used (as the reg_resample
call uses this->_output_image_forward_nifti_sptr
), but that's an inefficiency, not a leak.
I also wonder why it saves the result in this->_output_image_sptr
(it's already in this->_output_image_forward_nifti_sptr
), but that's just another shared_ptr
.
so my initial impression that indeed this is somewhere in the Python interface, but again, I'd feel better if we run valgrind
on C++ test.
@paskino I don't think you're running the Debug
version by the way. you should see line-numbers in the valgrind
output
what where you "valgrinding"? It'd be good to do this for the ctests. nice and isolated...
I was valgringing valgrind --leak-check=full --show-leak-kinds=all python test_memory.py >& logfile_leak_debug_all
I do see that it creates a
output_nifti_sptr
which then never seems to be actually used (as the reg_resample call usesthis->_output_image_forward_nifti_sptr
), but that's an inefficiency, not a leak.
@KrisThielemans , I create output_nifti_sptr
so that I can check that the metadata of image that has been passed to the forward
method matches the metadata of the image that was used for the set_up
(this->_output_image_forward_nifti_sptr
). Then, since the output image is only used for metadata, it then doesn't matter whether output_nifti_sptr
or this->_output_image_forward_nifti_sptr
gets passed to NiftyReg.
I also wonder why it saves the result in
this->_output_image_sptr
(it's already inthis->_output_image_forward_nifti_sptr
), but that's just anothershared_ptr
.
I had to do this to keep backwards compatibility of get_output
(requires the most recent output to be stored in memory).
currently testing on sirf_resample
with the following parameters
valgrind --leak-check=full --show-leak-kinds=all sirf_resample -algo niftyreg -ref ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -flo ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -eng_ref STIR -eng_flo STIR -adj >& memory_leak_output_adjoint.log
valgrind --leak-check=full --show-leak-kinds=all sirf_resample -algo niftyreg -ref ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -flo ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -eng_ref STIR -eng_flo STIR >& memory_leak_output_direct.log
I think I stumbled upon a memory leak in
SIRFReg
Basically I'm running a
NiftyResample
forward and backward 2000 times. My machine has 100Gb of RAM and the memory occupancy goes up like this (first column is number of iteration, second is the % of memory occupied, 3rd is memory occupied every iteration)Below a minimal code to reproduce the above.