Why is PlotPedestalAndNoise dying?

benkrikler commented 10 years ago

I'm trying to run over an entire run and after event 119, I get a bad_alloc exception. My modules file is:

[MODULES]
plot=PlotPedestalAndNoise
[plot]
export_sql=1

so the only possible source of this is the PlotPedestalAndNoise module.

If I put in the suggestions from #175, I confirm this and it's clear it's coming from entry 119.

The backtrace I get if I run in gdb is:

0x00000038e4e32925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install freetype-2.3.11-14.el6_3.1.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64 pcre-7.8-6.el6.x86_64 sqlite-3.6.20-1.el6.x86_64
(gdb) bt
#0  0x00000038e4e32925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000038e4e34105 in abort () at abort.c:92
#2  0x00000038ebabea5d in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:93
#3  0x00000038ebabcbe6 in __cxxabiv1::__terminate (handler=<value optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:38
#4  0x00000038ebabcc13 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#5  0x00000038ebabcd0e in __cxxabiv1::__cxa_throw (obj=0xf88f90, tinfo=<value optimized out>, dest=<value optimized out>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:83
#6  0x00000038ebabd0fd in operator new (sz=6583419) at ../../../../libstdc++-v3/libsupc++/new_op.cc:58
#7  0x00000038ebabd1b9 in operator new[] (sz=<value optimized out>) at ../../../../libstdc++-v3/libsupc++/new_opv.cc:32
#8  0x00007ffff696b9f9 in TKey::TKey (this=0x10da6d0, obj=<value optimized out>, name=<value optimized out>, bufsize=2170,
    motherDir=<value optimized out>) at /vols/comet00/users/bek07/Alcap/AlcapDAQ/root/io/io/src/TKey.cxx:256
#9  0x00007ffff693d7ca in TFile::CreateKey (this=<value optimized out>, mother=0xfe29d0, obj=0xfb5c00,
    name=0xf4f3f0 "fPedestalVsNoiseHistogram_muSc", bufsize=2170) at /vols/comet00/users/bek07/Alcap/AlcapDAQ/root/io/io/src/TFile.cxx:949
#10 0x00007ffff6936454 in TDirectoryFile::WriteTObject (this=0xfe29d0, obj=0xfb5c00, name=<value optimized out>,
    option=<value optimized out>, bufsize=0) at /vols/comet00/users/bek07/Alcap/AlcapDAQ/root/io/io/src/TDirectoryFile.cxx:1818
#11 0x00007ffff78883ee in TObject::Write (this=0xfb5c00, name=0x0, option=<value optimized out>, bufsize=0)
    at /vols/comet00/users/bek07/Alcap/AlcapDAQ/root/core/base/src/TObject.cxx:781
#12 0x00007ffff69374b7 in TDirectoryFile::Write (this=0xfe29d0, opt=0, bufsize=0)
    at /vols/comet00/users/bek07/Alcap/AlcapDAQ/root/io/io/src/TDirectoryFile.cxx:1694
#13 0x00007ffff69374b7 in TDirectoryFile::Write (this=0xfbd920, opt=0, bufsize=0)
    at /vols/comet00/users/bek07/Alcap/AlcapDAQ/root/io/io/src/TDirectoryFile.cxx:1694
#14 0x00007ffff693f203 in TFile::Write (this=0xfbd920, opt=0, bufsiz=0)
    at /vols/comet00/users/bek07/Alcap/AlcapDAQ/root/io/io/src/TFile.cxx:2194
#15 0x0000000000424ba6 in main (argc=<value optimized out>, argv=<value optimized out>) at src/main.cpp:157

so I think the actual culprit is inlined somewhere, but I'm not sure where. There are only a few occurrences of new from what I can see and I don't think they're being called before this crashes.

benkrikler commented 10 years ago

I've spent a long long time now digging around in valgrind output trying to get to the bottom of things.

I suspect that I'm running on a system with low memory available, indeed ulimit suggests 3GB. If I decrease the number of bins in PlotPedestalAndNoise then I'm able to analyse the entire run.

For what it's worth, the largest leak I could spot running with just this module and coming from our own code (as opposed to ROOTs) was 130 KB. It really does seem then that what we're doing is just very memory intensive.

It would be interesting to profile each module and check it all properly but I'll leave that for another day and close this for now.

litchfld commented 10 years ago

If you suspect it's a resource shortage, but aren't sure try running starting at event 100 or something. The fail point should move forward by about 100 events. (also try starting after the problematic event)

benkrikler commented 10 years ago

I tried that a previous time and saw it have no impact, so concluded then that it was a resource issue as well. I haven't tried it with this though, so if it bites me again I'll take a look.

benkrikler commented 10 years ago

This module crashed again when running on the batch system for production (issue #166). The problem looks like the same as this one, where a bad_alloc is thrown with no other warning. I'm not sure how the batch system handles memory and if many processes are running simultaneously but if they share the memory then running fewer jobs simultaneously could help.

alcap-org / AlcapDAQ

Why is PlotPedestalAndNoise dying? #176