Open rijobro opened 4 years ago
Also, how should I interpret the end of the valgrind output, with 20GB still reachable?
==10926== LEAK SUMMARY:
==10926== definitely lost: 1,608 bytes in 12 blocks
==10926== indirectly lost: 16,460 bytes in 41 blocks
==10926== possibly lost: 895,780 bytes in 1,615 blocks
==10926== still reachable: 20,810,150 bytes in 24,290 blocks
==10926== of which reachable via heuristic:
==10926== stdstring : 5,374 bytes in 107 blocks
==10926== multipleinheritance: 4,448 bytes in 5 blocks
==10926== suppressed: 0 bytes in 0 blocks
Lastly, it's also possible that this is a CIL issue as PET_MCIR.py uses both SIRF and CIL. @paskino
End of the output from valgrind on osem_reconstruction.py
(trying to test a smaller component) looks similar, but still undecipherable:
==12421== LEAK SUMMARY:
==12421== definitely lost: 976 bytes in 5 blocks
==12421== indirectly lost: 20,172 bytes in 46 blocks
==12421== possibly lost: 324,318 bytes in 223 blocks
==12421== still reachable: 13,087,070 bytes in 6,847 blocks
==12421== suppressed: 0 bytes in 0 blocks
==12421==
==12421== For counts of detected and suppressed errors, rerun with: -v
==12421== Use --track-origins=yes to see where uninitialised values come from
==12421== ERROR SUMMARY: 4634 errors from 186 contexts (suppressed: 0 from 0)
@paskino disabled numba on your recommendation, valgrind output attached. Less data "possibly lost". Also no mentions of llvm this time, but I wonder if these were just warnings as opposed to areas of memory leaking? Honestly I've no idea...
Most of these are Python. Maybe there'd be less of them with a more recent Python (but maybe not). Looking for sirf
I found
a bug which I believe is in the omp library
==14176== 5,200 bytes in 13 blocks are possibly lost in loss record 1,083 of 1,236
==14176== at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14176== by 0x40134A6: allocate_dtv (dl-tls.c:286)
==14176== by 0x40134A6: _dl_allocate_tls (dl-tls.c:530)
==14176== by 0x5235227: allocate_stack (allocatestack.c:627)
==14176== by 0x5235227: pthread_create@@GLIBC_2.2.5 (pthread_create.c:644)
==14176== by 0x329F9F3F: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==14176== by 0x329F0EB9: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==14176== by 0x3489CD75: stir::BackProjectorByBin::set_up(std::shared_ptr<stir::ProjDataInfo> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> > const&) (BackProjectorByBin.cxx:89)
==14176== by 0x348E5111: stir::BackProjectorByBinUsingProjMatrixByBin::set_up(std::shared_ptr<stir::ProjDataInfo> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> > const&) (BackProjectorByBinUsingProjMatrixByBin.cxx:111)
==14176== by 0x348E61F1: stir::ProjectorByBinPair::set_up(std::shared_ptr<stir::ProjDataInfo> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> > const&) (ProjectorByBinPair.cxx:56)
==14176== by 0x348E6888: stir::ProjectorByBinPairUsingProjMatrixByBin::set_up(std::shared_ptr<stir::ProjDataInfo> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> > const&) (ProjectorByBinPairUsingProjMatrixByBin.cxx:96)
==14176== by 0x34842A52: sirf::PETAcquisitionModel::set_up(std::shared_ptr<sirf::PETAcquisitionData>, std::shared_ptr<sirf::STIRImageData>) (stir_x.cpp:527)
==14176== by 0x34824783: sirf::PETAcquisitionModelUsingMatrix::set_up(std::shared_ptr<sirf::PETAcquisitionData>, std::shared_ptr<sirf::STIRImageData>) (stir_x.h:463)
==14176== by 0x34813BAB: cSTIR_setupAcquisitionModel (cstir.cpp:463)
apparently a small leak in stir::VectorWithOffset
although that's rather weird as this should give tons of them. Maybe a corner case. https://github.com/UCL/STIR/issues/467
line 3718 in your log file
File "/home/rich/Documents/Code/SIRF-SuperBuild/Install/python/ccpi/optimisation/operators/Operator.py", line 482, in calculate_norm
return LinearOperator.calculate_norm(self, **kwargs)
TypeError: unbound method calculate_norm() must be called with LinearOperator instance as first argument (got CompositionOperator instance instead)
followed by some "invalid reads", which are worrying. might have to do with exception handling in STIR.
a few small leaks in SIRF at L5670 and below (@evgueni-ovtchinnikov have a look at the file, as I didn't paste everything here)
==14176== 24 bytes in 1 blocks are indirectly lost in loss record 106 of 1,236
==14176== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14176== by 0x3481FD18: __shared_count<sirf::PETAcquisitionModelUsingMatrix*> (shared_ptr_base.h:584)
==14176== by 0x3481FD18: __shared_count<sirf::PETAcquisitionModelUsingMatrix*> (shared_ptr_base.h:595)
==14176== by 0x3481FD18: __shared_ptr<sirf::PETAcquisitionModelUsingMatrix> (shared_ptr_base.h:1079)
==14176== by 0x3481FD18: shared_ptr<sirf::PETAcquisitionModelUsingMatrix> (shared_ptr.h:129)
==14176== by 0x3481FD18: cSTIR_newObject (cstir.cpp:87)
==14176== by 0x347F0879: _wrap_cSTIR_newObject (pystirPYTHON_wrap.cxx:3715)
one in SIRF setVerbosity
==14176== 32 bytes in 1 blocks are definitely lost in loss record 214 of 1,236
==14176== at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14176== by 0x3480E7B2: cSTIR_setVerbosity (cstir.cpp:71)
==14176== by 0x347EFE37: _wrap_cSTIR_setVerbosity (pystirPYTHON_wrap.cxx:3692)
These re-occur a few times of course. They don't seem to be dramatic though, so not sure if this will help, but I did give up about half-way though the log file. Fixing these small leaks will be good, and also clean-up the log of course.
I tried simplifying as much as possible to find the leak, so I removed the motion component to simply do at PET reconstruction with PDHG (no corrections with attenuation, randoms, norm, etc.).
Caused my Mac to run out of memory so the leak is still present in this code, will run it through valgrind (on linux as not available on OSX).
The bulk of the code is below, the main objects that could be at fault: AcquisitionData
, sirf.STIR.ImageData
, AcquisitionModelUsingRayTracingMatrix
, KullbackLeibler
, BlockFunction
, BlockOperator
, PDHG
, NiftiImageData
.
Question for CCPi guys (@paskino, @gfardell, @epapoutsellis): Can I simplify the code further? Do I need BlockFunction
and BlockOperator
if I only have one acquisition model?
sino = pet.AcquisitionData(sino_file)
sino = make_sino_positive(sino)
image = sino.create_uniform_image(1.0, nxny)
print("Setting up acquisition model...")
acq_model = pet.AcquisitionModelUsingRayTracingMatrix()
acq_model.set_up(sino, image)
print("Setting up reconstructor...")
# Configure the PDHG algorithm
kl = KullbackLeibler(b=sino, eta=(sino * 0 + 1e-5))
f = BlockFunction(kl)
K = BlockOperator(acq_model)
normK = K.norm(iterations=10)
# normK = LinearOperator.PowerMethod(K, iterations=10)[0]
# default values
sigma = 1/normK
tau = 1/normK
sigma = 0.001
tau = 1/(sigma*normK**2)
print("Norm of the BlockOperator ", normK)
# No regularisation, only positivity constraints
G = IndicatorBox(lower=0)
print("Creating up reconstructor...")
pdhg = PDHG(f=f, g=G, operator=K, sigma=sigma, tau=tau,
max_iteration=1000,
update_objective_interval=1)
for i in range(1,num_iters+1):
print("Running iteration " + str(i) + "...")
pdhg.run(1, verbose=True)
reg.NiftiImageData(pdhg.get_output()).write(outp_prefix + "_iters" + str(i))
There is/are error leak(s) in examples/Python/PETMR/PET_MCIR.py.
I ran it through valgrind like this:
The valgrind output is attached, I can't make head or tail of it.
I also tried memory profiler and the output is more concise. Their example shows memory increasing during allocation, and then decreasing when objects go out of scope/destructors are called. My output doesn't show any decrease in memory. I wonder if it means that everything is leaking memory (please don't be this!) or that the profiler isn't smart enough to notice objects are getting deleted on the C-level. Output also attached.
Any ideas? I suppose one thing I need to do is break the code into small chunks and verify each bit, but this will take a long time...
Log files
valgrind_output.txt memprofiler_output.txt