SyneRBI / SIRF

Main repository for the CCP SynerBI software
http://www.ccpsynerbi.ac.uk
Other
59 stars 29 forks source link

memory leak in SIRFReg #542

Closed paskino closed 4 years ago

paskino commented 4 years ago

I think I stumbled upon a memory leak in SIRFReg

Basically I'm running a NiftyResample forward and backward 2000 times. My machine has 100Gb of RAM and the memory occupancy goes up like this (first column is number of iteration, second is the % of memory occupied, 3rd is memory occupied every iteration)

(100, 1.3999170911171037, 0.013999170911171038)
(200, 2.6157330165015225, 0.013078665082507612)
(300, 3.831273096153702, 0.012770910320512339)
(400, 5.047093331627687, 0.012617733329069219)
(500, 6.262560139757241, 0.012525120279514482)
(600, 7.478363134872961, 0.012463938558121602)
(700, 8.69390321452514, 0.012419861735035913)
(800, 9.90972776008869, 0.012387159700110863)
(900, 11.125263529651304, 0.012361403921834782)
(1000, 12.34104928440876, 0.01234104928440876)
(1100, 13.55657643379224, 0.012324160394356581)
(1200, 14.77240097935579, 0.012310334149463158)
(1300, 15.987932438828839, 0.012298409568329876)
(1400, 17.203731123854993, 0.012288379374182139)
(1500, 18.41934016494023, 0.012279560109960154)
(1600, 19.63494920602547, 0.012271843253765919)
(1700, 20.85055824711071, 0.012265034263006298)
(1800, 22.06616728819595, 0.012258981826775527)
(1900, 23.28206079519256, 0.012253716207996083)

Below a minimal code to reproduce the above.

import numpy
from numpy.linalg import norm
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import os
import sys
import shutil

import sirf.STIR as pet
from sirf.Utilities import examples_data_path

import os
import psutil
process = psutil.Process(os.getpid())

# schemes: file, memory
pet.AcquisitionData.set_storage_scheme('file')

os.chdir(examples_data_path('PET'))

if True:
    shutil.rmtree('working_folder/memorytest',True)
    shutil.copytree('brain','working_folder/memorytest')
os.chdir('working_folder/memorytest')

image = pet.ImageData('emission.hv')

import sirf.Reg as Reg
from scipy.spatial.transform import Rotation as R

def get_resampler(directions, angles, image, degrees=True ):
    '''example input 'zy', [87,13], degrees=True'''
    r = R.from_euler(directions, angles, degrees=degrees)

    mat = r.as_dcm()

    tm = Reg.AffineTransformation()
    mat4 = tm.as_array()

    for i in range(3):
        for j in range(3):
            mat4[i][j] = mat[i][j]

    tm = Reg.AffineTransformation(mat4)

    mat = tm.as_array()

    resampler = Reg.NiftyResample()
    resampler.set_reference_image(image)
    resampler.set_floating_image(image)
    resampler.add_transformation(tm)
    resampler.set_padding_value(0)
    resampler.set_interpolation_type_to_linear()

    return resampler

resampler = get_resampler('zy', [-12.,30.], image, degrees=True) 
memocc = []

for i in range(2000):
    # new ImageData
    step = resampler.direct(image)
    # new AcquisitionData
    out = resampler.adjoint(step)
    if i % 100 == 0 and i > 0:
        memocc.append(process.memory_percent())
        print(i, memocc[-1], memocc[-1]/float(i))
# del out
paskino commented 4 years ago

Try adding prints at this line to see if the destructor is called at each iteration.

paskino commented 4 years ago

The __del__ method is not a destructor in Python. It is a method that is called when the the garbage collector runs.

An interesting reference

evgueni-ovtchinnikov commented 4 years ago

…and sometimes not called at all!

We need to check for cyclic references and perhaps consider using weakref.

BTW which Python are you using?

PS. Ran into a problem with my B&B – had to return to London. Will work at UCLH/home the rest of the week.

From: Edoardo Pasca [mailto:notifications@github.com] Sent: 10 February 2020 18:32 To: CCPPETMR/SIRF Cc: Subscribed Subject: Re: [CCPPETMR/SIRF] memory leak in SIRFReg (#542)

The delhttps://docs.python.org/3/reference/datamodel.html#object.__del__ method is not a destructor in Python. It is a method that is called when the the garbage collector runs.

An interesting referencehttps://eli.thegreenplace.net/2009/06/12/safely-using-destructors-in-python/

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/CCPPETMR/SIRF/issues/542?email_source=notifications&email_token=ADSZ6VF7XPS3CNRQ4OJXFJTRCGMS3A5CNFSM4KQ5CAVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELJXDVI#issuecomment-584282581, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADSZ6VARXWB7IFCGYVIL2N3RCGMS3ANCNFSM4KQ5CAVA.

paskino commented 4 years ago

I run a few tests and the __del__ method gets never called, hence the memory leak. There may be some cyclic reference.... investigating.

paskino commented 4 years ago

This is Python 2.7 on a VM.

evgueni-ovtchinnikov commented 4 years ago

This might have been sorted out in Python 3 - could you try? I understand there are plans to stop Python 2.7 support.

rijobro commented 4 years ago

Thanks for looking into this @paskino. Do you know which class is creating the memory leak? Presumably it's the NiftiImageData or the NiftyResample? Could be AffineTransformation, but it seems unlikely that a 4x4 matrix would be eating that much memory.

paskino commented 4 years ago

I'm not sure. I added prints to all the relevant __del__ methods in the various ImageData classes, i.e. SIRF.DataContainer STIR.ImageData Reg.NiftiImageData... and made a much simpler test: create and delete an instance of an image.


import os, shutil
import sirf.STIR as pet
from sirf.Utilities import examples_data_path
import numpy
import sirf.Reg as Reg

os.chdir(examples_data_path('PET'))

if True:
    shutil.rmtree('working_folder/memorytest',True)
    shutil.copytree('brain','working_folder/memorytest')
os.chdir('working_folder/memorytest')

image = pet.ImageData('emission.hv')

i2 = image.copy()

i2 = 0

mat4 = numpy.eye(4)
tm = Reg.AffineTransformation(mat4)

resampler = Reg.NiftyResample()
resampler.set_reference_image(image)
resampler.set_floating_image(image)
resampler.add_transformation(tm)
resampler.set_padding_value(0)
resampler.set_interpolation_type_to_linear()

direct = resampler.direct(image)

direct = 0

adjoint = resampler.adjoint(image)
adjoint = 0

This outputs

STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7f4e24ac9750>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7f4e2ed17570>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7f4e2ed17570>.

2 things:

  1. No NiftiImageData appears in the lines above.
  2. The deleted direct and adjoint instances have the same handle.
rijobro commented 4 years ago

In your output, there is only STIR.ImageData::__del__, nothing from the registration stuff. Are you sure you put a print into the NiftyResample and AffineTransformation deletion methods?

rijobro commented 4 years ago

I would also expect there to be 4 deletions: image, i2, direct and adjoint. Can you figure out which one is missing (or duplicate handle), by commenting them out one-by-one. I would also remove the i2 object, as it adds a confounding factor.

paskino commented 4 years ago

I have put that only in the NiftiImageData NiftiImageData3D and all the series. Not in the NiftiResample and AffineTransformation

In the above test, I expect only 3 deletions, as I'm running in a notebook so the interpreter is still alive. I expect the deletion of i2, direct and adjoint.

I added i2 to test if the deletion of the STIR.ImageData was working as expected.

paskino commented 4 years ago

Now, I removed i2 and the output, as run in a script is


STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7ff890727c60>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7ff890727c60>.
STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7ff8a479a870>.

The latter is the deletion of image.

rijobro commented 4 years ago

In your example __del__ won't be called for NiftiImageData or NiftiImageData3D because the conversion from STIR to NiftiImageData takes place on the C++ side.

I suppose the best thing to do if there's a memory leak is test on the C++ level first. If you're feeling motivated, you could stick a C++ version of your first code into test_cSynergistic.cpp and run it with Valgrind's memcheck. If not, I'll do it. Not sure when though.

KrisThielemans commented 4 years ago

could/should just run valgrind on the current ctests. Best results when you'd build SIRF in Debug or RelWithDebug (or whatever it's called). Note however that due to https://github.com/CCPPETMR/SIRF-SuperBuild/issues/315, you'll have to force this.

paskino commented 4 years ago

I added prints to NiftyResample and AffineTransformation, this the output:


 STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7fdbe379f390>.
 STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7fdbe379f390>.
AffineTransformation __del__ handle <Swig Object of type 'void *' at 0x7fdbe36019f0>
NiftyResample __del__ handle <Swig Object of type 'void *' at 0x7fdbe9b88150>
 STIR.ImageData __del__ with handle <Swig Object of type 'void *' at 0x7fdbe6c10990>.

I believe that the fact that the handle for direct and adjoint is the same is suspicious.

paskino commented 4 years ago

Guys, I have no clue on how to do this. I know names, but that's not my territory. @evgueni-ovtchinnikov ?

evgueni-ovtchinnikov commented 4 years ago

me neither

KrisThielemans commented 4 years ago

easy:

sudo apt install valgrind
valgrind test_whatever >& logfile

It'll report memory leaks and double deletes etc, some of them likely in the OS, but you don't care about those!

We could even add this in our test-suite. From this post, it seems we simply add

include(CTest)

and then (optionally) do ctest -T memcheck.

paskino commented 4 years ago

I cannot connect to security.ubuntu.com from my VM. Is this happening on your VirtualBoxes?

KrisThielemans commented 4 years ago

sudo apt update?

paskino commented 4 years ago

I rebuilt SIRF in Debug.

Valgring says:

==15880== LEAK SUMMARY:
==15880==    definitely lost: 760 bytes in 10 blocks
==15880==    indirectly lost: 11,098,740 bytes in 12,521 blocks
==15880==      possibly lost: 517,202 bytes in 434 blocks
==15880==    still reachable: 12,977,509 bytes in 6,693 blocks
==15880==         suppressed: 0 bytes in 0 blocks
==15880== Reachable blocks (those to which a pointer was found) are not shown.
==15880== To see them, rerun with: --leak-check=full --show-leak-kinds=all

This is an extract of the very verbose output from valgrind that contains a million of lines regarding Python.


==15880== 2,823,440 (96 direct, 2,823,344 indirect) bytes in 1 blocks are definitely lost in loss record 1,040 
of 1,043
==15880==    at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880==    by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880==    by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880==    by 0x1C42F276: sirf::NiftyResample<float>::adjoint(std::shared_ptr<sirf::ImageData>, std::shared_ptr<sirf::ImageData const>) (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880==    by 0x1C40301D: cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880==    by 0x1C3F2151: _wrap_cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)

==15880== 2,627,624 (96 direct, 2,627,528 indirect) bytes in 1 blocks are definitely lost in loss record 1,036 
of 1,043
==15880==    at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880==    by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880==    by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880==    by 0x18A0A957: cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)
==15880==    by 0x18A036A1: _wrap_cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)

==15880== 2,823,440 (96 direct, 2,823,344 indirect) bytes in 1 blocks are definitely lost in loss record 1,040 
of 1,043
==15880==    at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880==    by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880==    by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880==    by 0x1C42F276: sirf::NiftyResample<float>::adjoint(std::shared_ptr<sirf::ImageData>, std::shared_p
tr<sirf::ImageData const>) (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880==    by 0x1C40301D: cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880==    by 0x1C3F2151: _wrap_cReg_NiftyResample_adjoint (in /home/sirfuser/devel/install/python/sirf/_pyre
g.so)

==15880== 2,823,440 (96 direct, 2,823,344 indirect) bytes in 1 blocks are definitely lost in loss record 1,039 
of 1,043
==15880==    at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880==    by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880==    by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880==    by 0x1C430876: sirf::NiftyResample<float>::forward(std::shared_ptr<sirf::ImageData>, std::shared_p
tr<sirf::ImageData const>) (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880==    by 0x1C402B4D: cReg_NiftyResample_forward (in /home/sirfuser/devel/install/python/sirf/_pyreg.so)
==15880==    by 0x1C3F22F1: _wrap_cReg_NiftyResample_forward (in /home/sirfuser/devel/install/python/sirf/_pyre
g.so)

==15880== 2,824,284 (96 direct, 2,824,188 indirect) bytes in 1 blocks are definitely lost in loss record 1,041 
of 1,043
==15880==    at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15880==    by 0x19010902: stir::VoxelsOnCartesianGrid<float>::clone() const (in /home/sirfuser/devel/install/
python/sirf/_pystir.so)
==15880==    by 0x18DDE8EF: sirf::STIRImageData::clone_impl() const (in /home/sirfuser/devel/install/python/sir
f/_pystir.so)
==15880==    by 0x18A0A957: cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)
==15880==    by 0x18A036A1: _wrap_cSIRF_clone (in /home/sirfuser/devel/install/python/sirf/_pysirf.so)
paskino commented 4 years ago

sudo apt update?

it cannot reach security.ubuntu.com and apt update fails. For the sake of the exercise I just removed it from the apt sources.

KrisThielemans commented 4 years ago

what where you "valgrinding"? It'd be good to do this for the ctests. nice and isolated...

KrisThielemans commented 4 years ago

I cannot immediately see a leak on the C++ side of forward.

I do see that it creates a output_nifti_sptr which then never seems to be actually used (as the reg_resample call uses this->_output_image_forward_nifti_sptr), but that's an inefficiency, not a leak.

I also wonder why it saves the result in this->_output_image_sptr (it's already in this->_output_image_forward_nifti_sptr), but that's just another shared_ptr.

so my initial impression that indeed this is somewhere in the Python interface, but again, I'd feel better if we run valgrind on C++ test.

KrisThielemans commented 4 years ago

@paskino I don't think you're running the Debug version by the way. you should see line-numbers in the valgrind output

paskino commented 4 years ago

what where you "valgrinding"? It'd be good to do this for the ctests. nice and isolated...

I was valgringing valgrind --leak-check=full --show-leak-kinds=all python test_memory.py >& logfile_leak_debug_all

rijobro commented 4 years ago

I do see that it creates a output_nifti_sptr which then never seems to be actually used (as the reg_resample call uses this->_output_image_forward_nifti_sptr), but that's an inefficiency, not a leak.

@KrisThielemans , I create output_nifti_sptr so that I can check that the metadata of image that has been passed to the forward method matches the metadata of the image that was used for the set_up (this->_output_image_forward_nifti_sptr). Then, since the output image is only used for metadata, it then doesn't matter whether output_nifti_sptr or this->_output_image_forward_nifti_sptr gets passed to NiftyReg.

I also wonder why it saves the result in this->_output_image_sptr (it's already in this->_output_image_forward_nifti_sptr), but that's just another shared_ptr.

I had to do this to keep backwards compatibility of get_output (requires the most recent output to be stored in memory).

paskino commented 4 years ago

currently testing on sirf_resample with the following parameters

valgrind --leak-check=full --show-leak-kinds=all sirf_resample -algo niftyreg -ref ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -flo ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -eng_ref STIR -eng_flo STIR -adj >& memory_leak_output_adjoint.log

valgrind --leak-check=full --show-leak-kinds=all sirf_resample -algo niftyreg -ref ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -flo ~/devel/buildVM/sources/SIRF/data/examples/PET/brain/emission.hv -eng_ref STIR -eng_flo STIR >& memory_leak_output_direct.log