Closed MihaMuskinja closed 7 years ago
Is there also a leak when you run without systematics ? How many different output trees are you expecting ?
On Mon, Mar 6, 2017 at 9:09 AM, MihaMuskinja notifications@github.com wrote:
Hi @kratsg https://github.com/kratsg , @johnda102 https://github.com/johnda102 , all,
recently we launched our ntuple production with full experimental systematics and most of the MC jobs failed due to excess memory usage. An example from a memory monitor can be seen here:
http://aipanda054.cern.ch//media/filebrowser/b4bcaa93- c53b-4c9b-b408-aadb4b4fac08/panda/tarballPandaJob 3259570805_ANALY_IEPSAS-Kosice/memory_monitor_output.txt
memory usage goes up to 8GB and the job is killed afterwards of course. Now this could be a problem either in xAH or in CP tools themselves.
I ran Valgrind locally and noticed some memory leaks. The full Valgrind logfile is available here: https://cernbox.cern.ch/index. php/s/druXnzXtHed1gVx . Could please someone with more C++ knowledge than me go through this and maybe understand what is the problem?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCATLAS/xAODAnaHelpers/issues/821, or mute the thread https://github.com/notifications/unsubscribe-auth/AGx2ChFscuJWK5NYuHIWPYrJPGzsRN1Jks5rjBOCgaJpZM4MUMSs .
-- University of Chicago http://hep.uchicago.edu/~johnda
Hi, we have only nominal + 12 sys trees. I ran a local job without systematics and will post the Valgrind log file to this comment when finished.
EDIT: valgrind without systematics: https://cernbox.cern.ch/index.php/s/dICc1vUu141nGj3
The largest sources of mem leaks seem to be:
https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L233 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L373 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L428
https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L275 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L425
https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L640 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L641 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L642 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L643 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L644
Yes, these pointers are never deleted.. I guess this can start taking up memory..
std::vector< std::string > sysVariationNamesPID = new std::vector< std::string >; std::vector< std::string > sysVariationNamesReco = new std::vector< std::string >; std::vector< std::string > sysVariationNamesIso = new std::vector< std::string >; std::vector< std::string > sysVariationNamesTrig = new std::vector< std::string >; std::vector< std::string >* sysVariationNamesTrigMCEff = new std::vector< std::string >;
are we not cleaning something up properly?
I guess leaks in initialize are not so critical as they do not scale with Nevents.
On Mon, Mar 6, 2017 at 9:33 AM, MihaMuskinja notifications@github.com wrote:
The largest sources of mem leaks seem to be:
https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L233 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L373 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L428
https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L275 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L425
https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L640 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L641 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L642 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L643 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L644
are we not cleaning something up properly?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCATLAS/xAODAnaHelpers/issues/821#issuecomment-284412047, or mute the thread https://github.com/notifications/unsubscribe-auth/AGx2ChAzzSF6tvkU6T_Iuu9lTzo0J7DOks5rjBkqgaJpZM4MUMSs .
-- University of Chicago http://hep.uchicago.edu/~johnda
@johnda102 ,
yes you are correct, initialize is not the problem... However, ElectronEfficiencyCorrector leak occurs in executeSF, which I guess happens quite often? I think we are creating a bunch of strings in each executeSF and never deleting them.
@MihaMuskinja
std::vector< std::string >* sysVariationNamesPID = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesReco = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesIso = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesTrig = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesTrigMCEff = new std::vector< std::string >;
those cannot be leaking. They've all gotten recorded to the TStore which handles memory deletion for us. If the ElectronEfficiencyCorrector is leaking, that might not be something we can fix and needs to be raised with pathelp.
Also, there is a large mem leak from HelpTreeBase::FillMuons while it is not there for HelpTreeBase::FillElectrons.
Hi @kratsg ,
actually, they are only recorded to the TStore for the nominal configuration and not for systematics
if ( countSyst == 0 ) { RETURN_CHECK( "ElectronEfficiencyCorrector::executeSF()", m_store->record( sysVariationNamesPID, m_outputSystNamesPID ), "Failed to record vector of systematic names PID" ); }
Hi all,
running xAH without electron calib. systematics greatly reduced mem leak from executeSF. I really believe that one of the problems is not deleting std::vector< std::string >* sysVariationNames for anything else than nominal sys.
Log file: https://cernbox.cern.ch/index.php/s/dICc1vUu141nGj3
We should just record all of these to the TStore regardless of systematics or not.. Let me look at it.
I think that the idea is to record them only once since they are the same for nominal,sys1,sys2,.. and that would presumably save some disk space.
Seems like we should only create these containers if countSys is 0
On Mon, Mar 6, 2017 at 10:18 AM, Giordon Stark notifications@github.com wrote:
We should just record all of these to the TStore regardless of systematics or not.. Let me look at it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCATLAS/xAODAnaHelpers/issues/821#issuecomment-284425542, or mute the thread https://github.com/notifications/unsubscribe-auth/AGx2CrE3K5TT7qL-GT7AB-RtOeEFMCX1ks5rjCPBgaJpZM4MUMSs .
-- University of Chicago http://hep.uchicago.edu/~johnda
Seems like we should only create these containers if countSys is 0
I think this is the best way to do it. I'll make changes now.
OK, thanks @kratsg , @johnda102 . I will rerun one of the jobs that failed and post the results.
Could we also fix another leak? I think these in https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/HelpTreeBase.cxx#L332 are quite obvious:
std::vector< std::string >* tmp_reco_sys = new std::vector< std::string >;
std::vector< std::string >* tmp_iso_sys = new std::vector< std::string >;
...
this is never deleted and now I think there is nothing which would take care of them for us downstream.
@MihaMuskinja feel free to look over the changes. The last two you've mentioned
std::vector< std::string > tmp_reco_sys = new std::vector< std::string >; std::vector< std::string > tmp_iso_sys = new std::vector< std::string >; ...
don't need to be deleted. There's no reason to allocate memory for either of them, since they're just holding pointers to objects that are getting retrieved.
Hi @kratsg ,
looks good, also the same thing in MuonEffCorrector. Although, I think more than two vectors needed to be fixed in HelperTreeBase.cxx:
std::vector< std::string > tmp_reco_sys = new std::vector< std::string >; std::vector< std::string > tmp_iso_sys = new std::vector< std::string >; std::vector< std::string > tmp_trig_sys = new std::vector< std::string >; std::vector< std::string > tmp_ttva_sys = new std::vector< std::string >;
I fixed it in MuonEffCorrector already. I'll fix the other ones.
Hi @kratsg , @johnda102 ,
thank you for prompt responses and for fixing the leaks, but unfortunately, the jobs are still failing due to memory issues, e.g.: http://bigpanda.cern.ch/task/10901262/
Hi @kratsg , @johnda102 , all,
recently we launched our ntuple production with full experimental systematics and most of the MC jobs failed due to excess memory usage. An example from a memory monitor can be seen here:
http://aipanda054.cern.ch//media/filebrowser/b4bcaa93-c53b-4c9b-b408-aadb4b4fac08/panda/tarball_PandaJob_3259570805_ANALY_IEPSAS-Kosice/memory_monitor_output.txt
memory usage goes up to 8GB and the job is killed afterwards of course. Now this could be a problem either in xAH or in CP tools themselves.
I ran Valgrind locally and noticed some memory leaks. The full Valgrind logfile is available here: https://cernbox.cern.ch/index.php/s/druXnzXtHed1gVx . Could please someone with more C++ knowledge than me go through this and maybe understand what is the problem?