cms-sw / genproductions

Generator fragments for MC production
https://twiki.cern.ch/twiki/bin/view/CMS/GitRepositoryForGenProduction
79 stars 788 forks source link

Large gridpack fails with non zero status 137 #3131

Open GiacomoBoldrini opened 2 years ago

GiacomoBoldrini commented 2 years ago

Dear genproduction maintainers,

we've been trying for some months to generate gridpacks for two VBS processes with EFT contributions. The processes are: VBS OSWW in the fully leptonic final state [1] and VBS WV with semileptonic final state [2]. As support to this thread we prepared some sides with all the tests we did and can be found here. We are using cmsconnect and modified central scripts just to download the correct UFO models here and to avoid memory related compilation issues here.

While submitting integrate steps on condor we found that multiple jobs go into hold state with holdcode 13 and subcode 2 (condor_starter or shadow failed to send job) and eventually the computation stops

INFO: ClusterId 12046135 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046139 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046143 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046151 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12046040 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12034792 was held with code 13, subcode 2. Releasing it. 
INFO: ClusterId 12034924 was held with code 13, subcode 2. Releasing it. 
WARNING: ClusterId 12034926 with HoldReason: Error from glidein_9533_601358824@cms-h002.rcac.purdue.edu: Failed to execute '/data/02/tmp/execute/dir_9531/glide_epkWgb/condor_job_wrapper.sh' with arguments /data/02/tmp/execute/dir_9531/glide_epkWgb/execute/dir_36287/condor_exec.exe 0 2223.239 2223.241: (errno=2: 'No such file or directory') 
....

Running the same cards locally we see that local generation also fails as survey.sh ends with non zero status 137.

Generating gridpack with run name pilotrun
survey  pilotrun --accuracy=0.01 --points=2000 --iterations=8 --gridpack=.true.
INFO: compile directory
INFO: Using LHAPDF v6.2.1 interface for PDFs
compile Source Directory
Using random number seed offset = 21
INFO: Running Survey
Creating Jobs
Working on SubProcesses
INFO:     P1_qq_lvlqqqq
INFO:  Idle: 2080,  Running: 48,  Completed: 0 [ current time: 19h14 ]
rm: cannot remove 'results.dat': No such file or directory
ERROR DETECTED
WARNING: program /scratch/gboldrin/gp3/bin/MadGraph5_aMCatNLO/WmVjj_ewk_dim6/WmVjj_ewk_dim6_gridpack/work/processtmp/SubProcesses/survey.sh 0 89 90 launch ends with non zero status: 137. Stop all computation 

Upon a quick research on the internet we found that this could point to a memory issue, meaning the survey.sh job is requiring too much ram resources and so a sigkill is issued by the os.

Is this behaviour known? Maybe we are doing something wrong. Do you have any tests we can carry in order to better understand this issue?

Thank you for your time. Best,

Giacomo


[1]

import model SMEFTsim_U35_MwScheme_UFO-cW_cHWB_cHDD_cHbox_cHW_cHl1_cHl3_cHq1_cHq3_cqq1_cqq11_cqq31_cqq3_cll_cll1_massless
define p = g u c d s b u~ c~ d~ s~ b~
define j = g u c d s b u~ c~ d~ s~ b~
define l+ = e+ mu+ ta+
define l- = e- mu- ta-
define vl = ve vm vt
define vl~ = ve~ vm~ vt~
generate generate p p > l+ vl l- vl~ j j SMHLOOP=0 QCD=0 NP=1 
output WWjjTolnulnu_OS_ewk_dim6

[2]

import model SMEFTsim_U35_MwScheme_UFO-cW_cHWB_cqq1_cqq3_cqq11_cqq31_massless
define p = g u d s c b u~ d~ s~ c~ b~
define j = p
define l+ = e+ mu+ ta+
define l- = e- mu- ta-
define vl = ve vm vt
define vl~ = ve~ vm~ vt~
generate p p > l- vl~ j j j j NP=1 SMHLOOP=0 QCD=0
output WmVjj_ewk_dim6 -nojpeg
sansan9401 commented 2 years ago

Sorry for I didn't check this thread. I saw some commits(https://github.com/UniMiBAnalyses/genproductions/commit/ba7149d54f217050bc17436e623c83f813a969c9, https://github.com/UniMiBAnalyses/genproductions/commit/ef4ee85b50b39eac5a7ced8cae9092573bce59e8) on your master branch. Did it solve the problem?

GiacomoBoldrini commented 2 years ago

Hi @sansan9401 , well the condor problem seems to come and go. Sometimes we are able to run integrate perfectly, sometimes it crashes as above. We did not understand were the problem comes from but probably it is due to which tier the jobs land (does it make sense? i refer to this glidein_9533_601358824@cms-h002.rcac.purdue.edu where the institution might change from run to run).

For the local submission we have no solution however we found that even if the integrate step is successful, the computation of the second reweight matrix element is killed by the os due to memory pressure. We reported the issue here (where btw you can see the integrate successfully run). So probably the local run of the integrate step is subject to the same problems as the reweight.