Memory Leaks are causing jobs to fail on the grid [Urgent]

MihaMuskinja commented 7 years ago

Hi @kratsg , @johnda102 , all,

recently we launched our ntuple production with full experimental systematics and most of the MC jobs failed due to excess memory usage. An example from a memory monitor can be seen here:

http://aipanda054.cern.ch//media/filebrowser/b4bcaa93-c53b-4c9b-b408-aadb4b4fac08/panda/tarball_PandaJob_3259570805_ANALY_IEPSAS-Kosice/memory_monitor_output.txt

memory usage goes up to 8GB and the job is killed afterwards of course. Now this could be a problem either in xAH or in CP tools themselves.

I ran Valgrind locally and noticed some memory leaks. The full Valgrind logfile is available here: https://cernbox.cern.ch/index.php/s/druXnzXtHed1gVx . Could please someone with more C++ knowledge than me go through this and maybe understand what is the problem?

ghost commented 7 years ago

Is there also a leak when you run without systematics ? How many different output trees are you expecting ?

On Mon, Mar 6, 2017 at 9:09 AM, MihaMuskinja notifications@github.com wrote:

Hi @kratsg https://github.com/kratsg , @johnda102 https://github.com/johnda102 , all,

recently we launched our ntuple production with full experimental systematics and most of the MC jobs failed due to excess memory usage. An example from a memory monitor can be seen here:

http://aipanda054.cern.ch//media/filebrowser/b4bcaa93- c53b-4c9b-b408-aadb4b4fac08/panda/tarballPandaJob 3259570805_ANALY_IEPSAS-Kosice/memory_monitor_output.txt

memory usage goes up to 8GB and the job is killed afterwards of course. Now this could be a problem either in xAH or in CP tools themselves.

I ran Valgrind locally and noticed some memory leaks. The full Valgrind logfile is available here: https://cernbox.cern.ch/index. php/s/druXnzXtHed1gVx . Could please someone with more C++ knowledge than me go through this and maybe understand what is the problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCATLAS/xAODAnaHelpers/issues/821, or mute the thread https://github.com/notifications/unsubscribe-auth/AGx2ChFscuJWK5NYuHIWPYrJPGzsRN1Jks5rjBOCgaJpZM4MUMSs .

-- University of Chicago http://hep.uchicago.edu/~johnda

MihaMuskinja commented 7 years ago

Hi, we have only nominal + 12 sys trees. I ran a local job without systematics and will post the Valgrind log file to this comment when finished.

EDIT: valgrind without systematics: https://cernbox.cern.ch/index.php/s/dICc1vUu141nGj3

MihaMuskinja commented 7 years ago

The largest sources of mem leaks seem to be:

https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L233 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L373 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L428

https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L275 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/MuonEfficiencyCorrector.cxx#L425

https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L640 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L641 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L642 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L643 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ElectronEfficiencyCorrector.cxx#L644

Yes, these pointers are never deleted.. I guess this can start taking up memory..

std::vector< std::string > sysVariationNamesPID = new std::vector< std::string >; std::vector< std::string > sysVariationNamesReco = new std::vector< std::string >; std::vector< std::string > sysVariationNamesIso = new std::vector< std::string >; std::vector< std::string > sysVariationNamesTrig = new std::vector< std::string >; std::vector< std::string >* sysVariationNamesTrigMCEff = new std::vector< std::string >;

are we not cleaning something up properly?

ghost commented 7 years ago

I guess leaks in initialize are not so critical as they do not scale with Nevents.

On Mon, Mar 6, 2017 at 9:33 AM, MihaMuskinja notifications@github.com wrote:

The largest sources of mem leaks seem to be:

https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L233 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L373 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L428

https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L275 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/ Root/MuonEfficiencyCorrector.cxx#L425

https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L640 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L641 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L642 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L643 https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/ ElectronEfficiencyCorrector.cxx#L644

are we not cleaning something up properly?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCATLAS/xAODAnaHelpers/issues/821#issuecomment-284412047, or mute the thread https://github.com/notifications/unsubscribe-auth/AGx2ChAzzSF6tvkU6T_Iuu9lTzo0J7DOks5rjBkqgaJpZM4MUMSs .

-- University of Chicago http://hep.uchicago.edu/~johnda

MihaMuskinja commented 7 years ago

@johnda102 ,

yes you are correct, initialize is not the problem... However, ElectronEfficiencyCorrector leak occurs in executeSF, which I guess happens quite often? I think we are creating a bunch of strings in each executeSF and never deleting them.

kratsg commented 7 years ago

@MihaMuskinja

std::vector< std::string >* sysVariationNamesPID = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesReco = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesIso = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesTrig = new std::vector< std::string >;
std::vector< std::string >* sysVariationNamesTrigMCEff = new std::vector< std::string >;

those cannot be leaking. They've all gotten recorded to the TStore which handles memory deletion for us. If the ElectronEfficiencyCorrector is leaking, that might not be something we can fix and needs to be raised with pathelp.

MihaMuskinja commented 7 years ago

Also, there is a large mem leak from HelpTreeBase::FillMuons while it is not there for HelpTreeBase::FillElectrons.

MihaMuskinja commented 7 years ago

Hi @kratsg ,

actually, they are only recorded to the TStore for the nominal configuration and not for systematics

if ( countSyst == 0 ) { RETURN_CHECK( "ElectronEfficiencyCorrector::executeSF()", m_store->record( sysVariationNamesPID,  m_outputSystNamesPID ), "Failed to record vector of systematic names PID"  ); }

MihaMuskinja commented 7 years ago

Hi all,

running xAH without electron calib. systematics greatly reduced mem leak from executeSF. I really believe that one of the problems is not deleting std::vector< std::string >* sysVariationNames for anything else than nominal sys.

Log file: https://cernbox.cern.ch/index.php/s/dICc1vUu141nGj3

kratsg commented 7 years ago

We should just record all of these to the TStore regardless of systematics or not.. Let me look at it.

MihaMuskinja commented 7 years ago

I think that the idea is to record them only once since they are the same for nominal,sys1,sys2,.. and that would presumably save some disk space.

ghost commented 7 years ago

Seems like we should only create these containers if countSys is 0

On Mon, Mar 6, 2017 at 10:18 AM, Giordon Stark notifications@github.com wrote:

We should just record all of these to the TStore regardless of systematics or not.. Let me look at it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCATLAS/xAODAnaHelpers/issues/821#issuecomment-284425542, or mute the thread https://github.com/notifications/unsubscribe-auth/AGx2CrE3K5TT7qL-GT7AB-RtOeEFMCX1ks5rjCPBgaJpZM4MUMSs .

-- University of Chicago http://hep.uchicago.edu/~johnda

kratsg commented 7 years ago

Seems like we should only create these containers if countSys is 0

I think this is the best way to do it. I'll make changes now.

MihaMuskinja commented 7 years ago

OK, thanks @kratsg , @johnda102 . I will rerun one of the jobs that failed and post the results.

Could we also fix another leak? I think these in https://github.com/UCATLAS/xAODAnaHelpers/blob/master/Root/HelpTreeBase.cxx#L332 are quite obvious:

std::vector< std::string >* tmp_reco_sys = new std::vector< std::string >;
std::vector< std::string >* tmp_iso_sys = new std::vector< std::string >;
...

this is never deleted and now I think there is nothing which would take care of them for us downstream.

kratsg commented 7 years ago

@MihaMuskinja feel free to look over the changes. The last two you've mentioned

std::vector< std::string > tmp_reco_sys = new std::vector< std::string >; std::vector< std::string > tmp_iso_sys = new std::vector< std::string >; ...

don't need to be deleted. There's no reason to allocate memory for either of them, since they're just holding pointers to objects that are getting retrieved.

MihaMuskinja commented 7 years ago

Hi @kratsg ,

looks good, also the same thing in MuonEffCorrector. Although, I think more than two vectors needed to be fixed in HelperTreeBase.cxx:

std::vector< std::string > tmp_reco_sys = new std::vector< std::string >; std::vector< std::string > tmp_iso_sys = new std::vector< std::string >; std::vector< std::string > tmp_trig_sys = new std::vector< std::string >; std::vector< std::string > tmp_ttva_sys = new std::vector< std::string >;

kratsg commented 7 years ago

I fixed it in MuonEffCorrector already. I'll fix the other ones.

MihaMuskinja commented 7 years ago

Hi @kratsg , @johnda102 ,

thank you for prompt responses and for fixing the leaks, but unfortunately, the jobs are still failing due to memory issues, e.g.: http://bigpanda.cern.ch/task/10901262/

UCATLAS / xAODAnaHelpers

Memory Leaks are causing jobs to fail on the grid [Urgent] #821