Open GiacomoBoldrini opened 2 years ago
Sorry for I didn't check this thread. I saw some commits(https://github.com/UniMiBAnalyses/genproductions/commit/ba7149d54f217050bc17436e623c83f813a969c9, https://github.com/UniMiBAnalyses/genproductions/commit/ef4ee85b50b39eac5a7ced8cae9092573bce59e8) on your master branch. Did it solve the problem?
Hi @sansan9401 , well the condor problem seems to come and go. Sometimes we are able to run integrate perfectly, sometimes it crashes as above. We did not understand were the problem comes from but probably it is due to which tier the jobs land (does it make sense? i refer to this glidein_9533_601358824@cms-h002.rcac.purdue.edu
where the institution might change from run to run).
For the local submission we have no solution however we found that even if the integrate step is successful, the computation of the second reweight matrix element is killed by the os due to memory pressure. We reported the issue here (where btw you can see the integrate successfully run). So probably the local run of the integrate step is subject to the same problems as the reweight.
Dear genproduction maintainers,
we've been trying for some months to generate gridpacks for two VBS processes with EFT contributions. The processes are: VBS OSWW in the fully leptonic final state [1] and VBS WV with semileptonic final state [2]. As support to this thread we prepared some sides with all the tests we did and can be found here. We are using cmsconnect and modified central scripts just to download the correct UFO models here and to avoid memory related compilation issues here.
While submitting integrate steps on condor we found that multiple jobs go into hold state with holdcode 13 and subcode 2 (condor_starter or shadow failed to send job) and eventually the computation stops
Running the same cards locally we see that local generation also fails as survey.sh ends with non zero status 137.
Upon a quick research on the internet we found that this could point to a memory issue, meaning the survey.sh job is requiring too much ram resources and so a sigkill is issued by the os.
Is this behaviour known? Maybe we are doing something wrong. Do you have any tests we can carry in order to better understand this issue?
Thank you for your time. Best,
Giacomo
[1]
[2]