Open tommasodiotalevi opened 2 years ago
A new Issue was created by @Tommaso93 Tommaso Diotalevi.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Assign generators, simulation
New categories assigned: generators,simulation
@mdhildreth,@mkirsano,@alberto-sanchez,@SiewYan,@GurpreetSinghChahal,@Saptaparna,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks
In this job there are G4Exception messages. There are two kind of messages: 1) zero step in magnetic field with following push for 10^-7 mm 2) energy non-conservation in K0L interaction with hydrogen target at 50 GeV
Both are warnings, which may happens in CMSSW_7_1 from time to time but should not crash a run.
The job is killed after the message: "Killing PID 720". It is not clear what class is responsible for. Not obvious that it is Geant4, because Geant4 should not handle "PID".
In the release 7_1 the trace of errors was not ideal, so we need to identify the source and the reason of this this message.
One possible exercise - change initial random seed.
Hi @civanch if you mean the seed set by McM during validation, the pdmV people actually changed a lot of them but with the same problems.
From the log
Killing PID 720
...
/pool/condor/dir_48577/HIG-RunIISummer15wmLHEGS-05269_1_threads_test.sh: line 65: 720 Aborted cmsRun -e -j $REPORT_NAME HIG-
RunIISummer15wmLHEGS-05269_1_cfg.py
could hint the 720
to be the process ID of the cmsRun
. If that is the case, something external to cmsRun
killed it. The event loop took
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 at 01-Apr-2022 22:03:19.957 UTC
...
Begin processing the 618th record. Run 1, Event 618, LumiSection 1 at 02-Apr-2022 05:31:49.201 UTC
had taken almost 7.5 hours at that point. What is the time limit for the McM validation jobs?
For the ExternalLHEProducer
exception in particular, I see in that the code in master
has a way to prevent that exception for validation cases by setting VALIDATION_RUN
environment variable
https://github.com/cms-sw/cmssw/blob/966f7845ce97586ec1cd97417b1d6561e20d7f6c/GeneratorInterface/LHEInterface/plugins/ExternalLHEProducer.cc#L346-L357
but 7_1_47 does not have this functionality
https://github.com/cms-sw/cmssw/blob/61ec43d3731b9e678c3f54fd25651083a0609b00/GeneratorInterface/LHEInterface/plugins/ExternalLHEProducer.cc#L298-L302
@makortel , may this code be backported to 7_1_X?
One cannot exclude that the reason of the problem in an infinite loop inside Geant4. Control of loops were introduced later. It is not possible to backport in all places but if the class responsible for the infinite loop will be identified the fix may be added.
Perhaps try running w/o the SIM step to completely confirm the cause of the crash?
On Apr 24, 2022, at 10:00 PM, Vladimir Ivantchenko @.***> wrote:
@makortel , may this code be backported to 7_1_X?
One cannot exclude that the reason of the problem in an infinite loop inside Geant4. Control of loops were introduced later. It is not possible to backport in all places but if the class responsible for the infinite loop will be identified the fix may be added.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.
may this code be backported to 7_1_X?
Actually it has already been backported in 7_1_X in https://github.com/cms-sw/cmssw/pull/34362, but after 7_1_47 and (AFAICT) there has been no releases in 7_1_X since.
@Tommaso93 , likely working solution to reduce statistics per run is the best what you can do. Any software patch to this release may be a problem.
If the problem is inside Geant4, then loop(s), which provoke the crash, likely are not infinite but very long. In Geant4 after 7_1 many fixes and protections for long loops were introduced. Most are included into Run-2 legacy 10_6 (Geant4 10.4.3).
If indeed we want to understand what happens, then as the first step it would be needed to run a test, in which GEN-SIM step is substituted by GEN only step as @davidlange6 proposed. This would be a minimal check. If Geant4 is responsible, I am not sure if we should do next steps.
Dear experts, this issue is a follow-up of this thread on CMS talk: https://cms-talk.web.cern.ch/t/externallheproducer-fatal-exception-lhe-file-contains-more-events-than-requested/8904
During the validation of several requests on McM (e.g. HIG-RunIISummer15wmLHEGS-05269), the following fatal error appears:
----- Begin Fatal Exception ----------------------- An exception of category ‘EventGenerationFailure’ occurred while [0] Processing run: 1 [1] Calling endRun for unscheduled module ExternalLHEProducer/‘externalLHEProducer’ Exception Message: Error in ExternalLHEProducer::endRunProduce(). Event loop is over, but there are still lhe events to process.This could happen if lhe file contains more events than requested. This is never expected to happen. ----- End Fatal Exception ------------------------------------------------
From PdmV side, several attempts have been made to change the SEED during validation, without success. From GEN side, the request setup is correct. Attached you can find the log of the validation failure [McM] Validation failed for HIG-RunIISummer15wmLHEGS-05269.pdf. From the log, seems that G4 crashes after some Pythia8 errors that are not stopping the execution of the job.
Currently this issue is blocking the McM validation, hence the MC production of these requests.
Best, Tommaso
PS: A workaround that worked, for the moment, is to increase the time/event, tricking the system to run lesser events and therefore avoiding the bugs. This of course does not solve the issue but avoids it, and in tier sites can still happen.