JeffersonLab / halld_sim

Simulation for the GlueX Experiment in Hall D
6 stars 10 forks source link

mcsmear failing #78

Closed T-Britton closed 5 years ago

T-Britton commented 5 years ago

MCwrapper-bot has the ability to rerun failed jobs. It has a maximum retry number of 10 (so no infinite loops of failures)

The first hint of something going wrong occurred with Project# 624 on 07-19-2019. Everything seemed fine until the last 1.11% (4 jobs) which exceeded the 10 retries. In fact they have never succeeded. This was annoying but nothing critical. Peter Pauli submitted project 681(09-06-1019) which quickly stalled out at ~50%. This was around the same time as the ccdb issue. That project was suspended. Nacer submitted 10 projects on 09-17-2019, none of these projects have reached 100% completion. There are 3 that stalled out at between 60-70% (after some forcing) and the rest have stalled out around 93-99% complete. By looking at the version sets used it seems that projects using recon-2018_01-ver02_2.xml stall out around 50%-70% (Peter's and Nacer's projects). Projects using recon-2017_01-ver03_11.xml stall out over 90% (Alex's and Nacers). Another commonality is the analysis version with the the 50% stallouts requesting analysis-2018_01-ver02.xml and the 90%s requesting analysis-2017_01-ver30.xml. I had placed a record in the database of which program threw the error and discovered that the culprit is mcsmear so...not the analysis sets. To give a scale of the problem: since nacer's project (689) there have been a total of 71351 jobs submitted to the OSG, 53789 of these OSG jobs failed. Of the 71k jobs 51948 failed with mcmear throwing the error. I have not gone digging into the logs but thought this should be brought up with the software group; Mark I. suggested I put this info in an issue.

markito3 commented 5 years ago

Nacer sent a related problem report to the software help list.

nacer-h commented 5 years ago

++++++++++++++++++++ log ouputs with : recon-2017_01-ver03_11.xml +++++++++++++++

ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. Using runNo: 30965

Maximum number of events: 20000 Opening file genr8_030965_004.ascii for output. Reading: targetp.x targetp.y targetp.z targetMass Found: 0.000000 0.000000 0.000000 0.938272 Reading: t-channelSlope Found: 1.110000 Reading: number of particles need to describe the decay Found: 8 Reading: part# chld1# chld2# prnt# Id nchld mass width chrg flag
Found: 0 -1 -1 -1 14 0 0.938272 0.000000 1 11 Found: 1 2 3 -1 0 2 2.188000 0.083000 0 0 Found: 2 4 5 1 0 2 1.019461 0.004266 0 0 Found: 3 6 7 1 0 2 1.000000 1.000000 0 0 Found: 4 6 7 2 11 0 0.493677 0.000000 1 11 Found: 5 6 7 2 12 0 0.493677 0.000000 -1 11 Found: 6 6 7 3 8 0 0.139570 0.000000 1 11 Found: 7 6 7 3 9 0 0.139570 0.000000 -1 11 Found EOI---- Input File appears Fine. Calculating Lorentz Factor: 9000 ^MCalculating Lorentz Factor: 8000 ^MCalculating Lorentz Factor: 7000 ^MCalculating Lorentz Factor: 6000 ^MCalculating Lorentz Factor: 5000 ^MCalculating Lorentz Factor: 4000 ^MCalculating Lorentz Factor: 3000 ^MCalculating Lorentz Factor: 2000 ^MCalculating Lorentz Factor: 1000 ^MCalculating Lorentz Factor: 100 ^MCalculating Lorentz Factor: 90 ^MCalculating Lorentz Factor: 80 ^MCalculating Lorentz Factor: 70 ^MCalculating Lorentz Factor : 60 ^MCalculating Lorentz Factor: 50 ^MCalculating Lorentz Factor: 40 ^MCalculating Lorentz Factor: 30 ^MCalculating Lorentz Factor: 20 ^MCalculating Lorentz Factor: 10 ^MCalculating Lorentz Factor: 0 ^MEvents generated: 100 Events acce pted: 40 ^MEvents generated: 200 Events accepted: 81 ^MEvents generated: 300 Events accepted: 121 ^MEvents generated: 400 Events accepted: 162 ^MEvents generated: 500 Events accepted: 202 ^MEvents generated: 600 Events accepted: 241 ^MEv ents generated: 700 Events accepted: 273 ^MEvents generated: 800 Events accepted: 309 ^MEvents generated: 900 Events accepted: 349 ^MEvents generated: 1000 Events accepted: 388 ^MEvents generated: 10000 Events accepted: 4038 ^MEvents gen erated: 20000 Events accepted: 8038 ^MEvents generated: 30000 Events accepted: 11990 ^MEvents generated: 40000 Events accepted: 15945 ^MEvents generated: 50000 Events accepted: 19963 ^MMax Lorentz Factor:0.070507 Events generated:50102 E vents accepted:20000

Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. G4WT0 > Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. JANA ERROR>>thread 0 has stalled on run:30965 event:1 JANA ERROR>> Thread 0 (2ba134402700) hasn't responded in 3601 seconds. (run:event=30965:1) Cancelling ... JANA ERROR>> Caught HUP signal for thread 0x2ba134402700 thread exiting... JANA ERROR>> JANA ERROR>> Last thread to lock output file mutex: 0x2ba134402700 JANA ERROR>> Attempting to unlock mutex to avoid deadlock. JANA ERROR>> However, the output file is likely corrupt at JANA ERROR>> this point and the process should be restarted ... JANA ERROR>> Generating stack trace... JANA ERROR>> JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to JANA ERROR>> have the program relaunch a replacement thread when a stalled JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration JANA ERROR>> parameter to a value greater than zero. E.g.: JANA ERROR>> JANA ERROR>> jana -PJANA:MAX_RELAUNCH_THREADS=10 JANA ERROR>> JANA ERROR>> The program will quit now. 0x00002ba10a114552 in XrdCl::XRootDMsgHandler::Process(XrdCl::Message) + 0x572 from /lib64/libXrdCl.so.2 0x00002ba10a0f53be in XrdCl::Stream::HandleIncMsgJob::Run(void) + 0xe from /lib64/libXrdCl.so.2 0x00002ba10a15f4ff in XrdCl::JobManager::RunJobs() + 0xaf from /lib64/libXrdCl.so.2 0x00002ba10a15f759 in from /lib64/libXrdCl.so.2 0x00002ba106b1fdd5 in from /lib64/libpthread.so.0 0x00002ba107855ead in clone + 0x6d from /lib64/libc.so.6 ./MakeMC.sh: line 1237: 530 Aborted (core dumped) mcsmear $MCSMEAR_Flags -PTHREAD_TIMEOUT_FIRST_EVENT=3600 -PTHREAD_TIMEOUT=3000 -o$STANDARD_NAME_geant$GEANTVER_smeared.hddm $STANDARD_NAME_geant$GEANTVER.hddm $XRD_RANDOMS_URL/random_triggers//$RANDBGTAG/run$formatted_runNumber_random.hddm\:1+$fold_skip_num

SIGABRT: abort PC=0x47293b m=0 sigcode=0

goroutine 1 [running, locked to thread]: syscall.RawSyscall(0x3e, 0x1aab3f, 0x6, 0x0, 0xc0002cbef0, 0x48f562, 0x1aab3f) /usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc0002cbeb8 sp=0xc0002cbeb0 pc=0x47293b syscall.Kill(0x1aab3f, 0x6, 0x4377de, 0xc0002cbf20) /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc0002cbf00 sp=0xc0002cbeb8 pc=0x46f2eb github.com/sylabs/singularity/internal/app/starter.Master.func4() internal/app/starter/master_linux.go:158 +0x3e fp=0xc0002cbf38 sp=0xc0002cbf00 pc=0x8d7ace github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1() internal/pkg/util/mainthread/mainthread.go:20 +0x2f fp=0xc0002cbf60 sp=0xc0002cbf38 pc=0x876eaf main.main() cmd/starter/main_linux.go:102 +0x68 fp=0xc0002cbf98 sp=0xc0002cbf60 pc=0x8d8308 runtime.main() /usr/lib/golang/src/runtime/proc.go:201 +0x207 fp=0xc0002cbfe0 sp=0xc0002cbf98 pc=0x42fa87 runtime.goexit() /usr/lib/golang/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0002cbfe8 sp=0xc0002cbfe0 pc=0x45b5a1

goroutine 5 [syscall]: os/signal.signal_recv(0xaa51a0) /usr/lib/golang/src/runtime/sigqueue.go:139 +0x9c os/signal.loop() /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22 created by os/signal.init.0 /usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41

goroutine 7 [chan receive]: github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc0001a2080) internal/pkg/util/mainthread/mainthread.go:23 +0xb4 github.com/sylabs/singularity/internal/app/starter.Master(0xc, 0xa, 0x2c00, 0x1aab6d, 0xc00000cb20) internal/app/starter/master_linux.go:157 +0x44e main.startup() cmd/starter/main_linux.go:73 +0x563 created by main.main cmd/starter/main_linux.go:98 +0x3e

rax 0x0 rbx 0x0 rcx 0xffffffffffffffff rdx 0x0 rdi 0x1aab3f rsi 0x6 rbp 0xc0002cbef0 rsp 0xc0002cbeb0 r8 0x0 r9 0x0 r10 0x0 r11 0x206 r12 0xc r13 0xff r14 0xa99e3c r15 0x0 rip 0x47293b rflags 0x206 cs 0x33 fs 0x0 gs 0x0

Closed HDDM file 0 event written to genr8_030965_004_geant4_smeared.hddm JANA >>Merging event reader thread ... JANA >> 0 events processed (11 events read) Average rate: 0.0Hz

Something went wrong with mcsmear status code: 134 Job finished with exit code 134

##############################

ESC[33mWARNING:ESC[0m container does not have /.singularity.d/actions/exec, calling /srv/.osgvo-user-job-wrapper.sh directly Using runNo: 11455 Maximum number of events: 1350 Opening file genr8_011455_003.ascii for output. Reading: targetp.x targetp.y targetp.z targetMass Found: 0.000000 0.000000 0.000000 0.938272 Reading: t-channelSlope Found: 1.160000 Reading: number of particles need to describe the decay Found: 8 Reading: part# chld1# chld2# prnt# Id nchld mass width chrg flag
Found: 0 -1 -1 -1 14 0 0.938272 0.000000 1 11 Found: 1 2 3 -1 0 2 2.188000 0.083000 0 0 Found: 2 4 5 1 0 2 1.019461 0.004266 0 0 Found: 3 6 7 1 0 2 1.000000 1.000000 0 0 Found: 4 6 7 2 11 0 0.493677 0.000000 1 11 Found: 5 6 7 2 12 0 0.493677 0.000000 -1 11 Found: 6 6 7 3 8 0 0.139570 0.000000 1 11 Found: 7 6 7 3 9 0 0.139570 0.000000 -1 11 Found EOI---- Input File appears Fine. Calculating Lorentz Factor: 9000 ^MCalculating Lorentz Factor: 8000 ^MCalculating Lorentz Factor: 7000 ^MCalculating Lorentz Factor: 6000 ^MCalculating Lorentz Factor: 5000 ^MCalculating Lorentz Factor: 4000 ^MCalculating Lorentz Factor: 3000 ^MCalculating Lorentz Factor: 2000 ^MCalculating Lorentz Factor: 1000 ^MCalculating Lorentz Factor: 100 ^MCalculating Lorentz Factor: 90 ^MCalculating Lorentz Factor: 80 ^MCalculating Lorentz Factor: 70 ^MCalculating Lorentz Factor: 60 ^MCalculating Lorentz Factor: 50 ^MCalculating Lorentz Factor: 40 ^MCalculating Lorentz Factor: 30 ^MCalculating Lorentz Factor: 20 ^MCalculating Lorentz Factor: 10 ^MCalculating Lorentz Factor: 0 ^MEvents generated: 100 Events accepted: 42 ^MEvents generated: 200 Events accepted: 77 ^MEvents generated: 300 Events accepted: 125 ^MEvents generated: 400 Events accepted: 164 ^MEvents generated: 500 Events accepted: 200 ^MEvents generated: 600 Events accepted: 240 ^MEvents generated: 700 Events accepted: 274 ^MEvents generated: 800 Events accepted: 312 ^MEvents generated: 900 Events accepted: 356 ^MEvents generated: 1000 Events accepted: 394 ^MMax Lorentz Factor:0.075927 Events generated:3553 Events accepted:1350

Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. G4WT0 > Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. JANA ERROR>>thread 0 has stalled on run:11455 event:1 JANA ERROR>> Thread 0 (2ad873d7e700) hasn't responded in 3601 seconds. (run:event=11455:1) Cancelling ... JANA ERROR>> Caught HUP signal for thread 0x2ad873d7e700 thread exiting... JANA ERROR>> JANA ERROR>> Last thread to lock output file mutex: 0x2ad873d7e700 JANA ERROR>> Attempting to unlock mutex to avoid deadlock. JANA ERROR>> However, the output file is likely corrupt at JANA ERROR>> this point and the process should be restarted ... JANA ERROR>> JANA ERROR>> JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to JANA ERROR>> have the program relaunch a replacement thread when a stalled JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration JANA ERROR>> parameter to a value greater than zero. E.g.: JANA ERROR>> JANA ERROR>> jana -PJANA:MAX_RELAUNCH_THREADS=10 JANA ERROR>> JANA ERROR>> The program will quit now. (END)

sdobbs commented 5 years ago

I haven't been able to reproduce this on the command line yet. I do wonder if it is a problem with random trigger file distribution?

On Thu, Oct 10, 2019 at 11:42 AM nacer notifications@github.com<mailto:notifications@github.com> wrote:

ESC[33mWARNING:ESC[0m container does not have /.singularity.d/actions/exec, calling /srv/.osgvo-user-job-wrapper.sh directly Using runNo: 11455 Maximum number of events: 1350 Opening file genr8_011455_003.ascii for output. Reading: targetp.x targetp.y targetp.z targetMass Found: 0.000000 0.000000 0.000000 0.938272 Reading: t-channelSlope Found: 1.160000 Reading: number of particles need to describe the decay Found: 8 Reading: part# chld1# chld2# prnt# Id nchld mass width chrg flag Found: 0 -1 -1 -1 14 0 0.938272 0.000000 1 11 Found: 1 2 3 -1 0 2 2.188000 0.083000 0 0 Found: 2 4 5 1 0 2 1.019461 0.004266 0 0 Found: 3 6 7 1 0 2 1.000000 1.000000 0 0 Found: 4 6 7 2 11 0 0.493677 0.000000 1 11 Found: 5 6 7 2 12 0 0.493677 0.000000 -1 11 Found: 6 6 7 3 8 0 0.139570 0.000000 1 11 Found: 7 6 7 3 9 0 0.139570 0.000000 -1 11 Found EOI---- Input File appears Fine. Calculating Lorentz Factor: 9000 ^MCalculating Lorentz Factor: 8000 ^MCalculating Lorentz Factor: 7000 ^MCalculating Lorentz Factor: 6000 ^MCalculating Lorentz Factor: 5000 ^MCalculating Lorentz Factor: 4000 ^MCalculating Lorentz Factor: 3000 ^MCalculating Lorentz Factor: 2000 ^MCalculating Lorentz Factor: 1000 ^MCalculating Lorentz Factor: 100 ^MCalculating Lorentz Factor: 90 ^MCalculating Lorentz Factor: 80 ^MCalculating Lorentz Factor: 70 ^MCalculating Lorentz Factor: 60 ^MCalculating Lorentz Factor: 50 ^MCalculating Lorentz Factor: 40 ^MCalculating Lorentz Factor: 30 ^MCalculating Lorentz Factor: 20 ^MCalculating Lorentz Factor: 10 ^MCalculating Lorentz Factor: 0 ^MEvents generated: 100 Events accepted: 42 ^MEvents generated: 200 Events accepted: 77 ^MEvents generated: 300 Events accepted: 125 ^MEvents generated: 400 Events accepted: 164 ^MEvents generated: 500 Events accepted: 200 ^MEvents generated: 600 Events accepted: 240 ^MEvents generated: 700 Events accepted: 274 ^MEvents generated: 800 Events accepted: 312 ^MEvents generated: 900 Events accepted: 356 ^MEvents generated: 1000 Events accepted: 394 ^MMax Lorentz Factor:0.075927 Events generated:3553 Events accepted:1350

Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. G4WT0 > Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. JANA ERROR>>thread 0 has stalled on run:11455 event:1 JANA ERROR>> Thread 0 (2ad873d7e700) hasn't responded in 3601 seconds. (run:event=11455:1) Cancelling ... JANA ERROR>> Caught HUP signal for thread 0x2ad873d7e700 thread exiting... JANA ERROR>> JANA ERROR>> Last thread to lock output file mutex: 0x2ad873d7e700 JANA ERROR>> Attempting to unlock mutex to avoid deadlock. JANA ERROR>> However, the output file is likely corrupt at JANA ERROR>> this point and the process should be restarted ... JANA ERROR>> JANA ERROR>> JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to JANA ERROR>> have the program relaunch a replacement thread when a stalled JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration JANA ERROR>> parameter to a value greater than zero. E.g.: JANA ERROR>> JANA ERROR>> jana -PJANA:MAX_RELAUNCH_THREADS=10 JANA ERROR>> JANA ERROR>> The program will quit now. (END)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/JeffersonLab/halld_sim/issues/78?email_source=notifications&email_token=AAJAS2SL2J4ZP4DGNMKG3ILQN5EOVA5CNFSM4I6U33VKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA42A6A#issuecomment-540647544, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAJAS2V5B2NZENXZ3Q62ZZTQN5EOVANCNFSM4I6U33VA.

nacer-h commented 5 years ago

++++++++++++++++++++ log ouputs with : recon-2018_01-ver02_2.xml +++++++++++++++

ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. Using runNo: 42446 Maximum number of events: 20000 Opening file genr8_042446_006.ascii for output. Reading: targetp.x targetp.y targetp.z targetMass Found: 0.000000 0.000000 0.000000 0.938272 Reading: t-channelSlope Found: 1.230000 Reading: number of particles need to describe the decay Found: 8 Reading: part# chld1# chld2# prnt# Id nchld mass width chrg flag
Found: 0 -1 -1 -1 14 0 0.938272 0.000000 1 11 Found: 1 2 3 -1 0 2 2.188000 0.083000 0 0 Found: 2 4 5 1 0 2 1.019461 0.004266 0 0 Found: 3 6 7 1 0 2 1.000000 1.000000 0 0 Found: 4 6 7 2 11 0 0.493677 0.000000 1 11 Found: 5 6 7 2 12 0 0.493677 0.000000 -1 11 Found: 6 6 7 3 8 0 0.139570 0.000000 1 11 Found: 7 6 7 3 9 0 0.139570 0.000000 -1 11 Found EOI---- Input File appears Fine. Calculating Lorentz Factor: 9000 ^MCalculating Lorentz Factor: 8000 ^MCalculating Lorentz Factor: 7000 ^MCalculating Lorentz Factor: 6000 ^MCalculating Lorentz Factor: 5000 ^MCalculating Lorentz Factor: 4000 ^MCalculating Lorentz Factor: 3000 ^MCalculating Lorentz Factor: 2000 ^MCalculating Lorentz Factor: 1000 ^MCalculating Lorentz Factor: 100 ^MCalculating Lorentz Factor: 90 ^MCalculating Lorentz Factor: 80 ^MCalculating Lorentz Factor: 70 ^MCalculating Lorentz Factor : 60 ^MCalculating Lorentz Factor: 50 ^MCalculating Lorentz Factor: 40 ^MCalculating Lorentz Factor: 30 ^MCalculating Lorentz Factor: 20 ^MCalculating Lorentz Factor: 10 ^MCalculating Lorentz Factor: 0 ^MEvents generated: 100 Events acce pted: 45 ^MEvents generated: 200 Events accepted: 99 ^MEvents generated: 300 Events accepted: 138 ^MEvents generated: 400 Events accepted: 182 ^MEvents generated: 500 Events accepted: 217 ^MEvents generated: 600 Events accepted: 261 ^MEv ents generated: 700 Events accepted: 295 ^MEvents generated: 800 Events accepted: 338 ^MEvents generated: 900 Events accepted: 372 ^MEvents generated: 1000 Events accepted: 406 ^MEvents generated: 10000 Events accepted: 3977 ^MEvents gen erated: 20000 Events accepted: 8048 ^MEvents generated: 30000 Events accepted: 11959 ^MEvents generated: 40000 Events accepted: 15957 ^MEvents generated: 50000 Events accepted: 19960 ^MMax Lorentz Factor:0.070956 Events generated:50129 E vents accepted:20000

Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. G4WT0 > Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. JANA ERROR>>thread 0 has stalled on run:42446 event:1 JANA ERROR>> Thread 0 (2b1f46a28700) hasn't responded in 3601 seconds. (run:event=42446:1) Cancelling ... JANA ERROR>> Caught HUP signal for thread 0x2b1f46a28700 thread exiting... JANA ERROR>> JANA ERROR>> Last thread to lock output file mutex: 0x2b1f46a28700 JANA ERROR>> Attempting to unlock mutex to avoid deadlock. JANA ERROR>> However, the output file is likely corrupt at JANA ERROR>> this point and the process should be restarted ... JANA ERROR>> Generating stack trace... JANA ERROR>> JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to JANA ERROR>> have the program relaunch a replacement thread when a stalled JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration JANA ERROR>> parameter to a value greater than zero. E.g.: JANA ERROR>> JANA ERROR>> jana -PJANA:MAX_RELAUNCH_THREADS=10 JANA ERROR>> JANA ERROR>> The program will quit now. : ####################### Job started: Thu Oct 10 11:35:10 UTC 2019 Simulating the Experiment: GlueX ccdbsqlite path: batch_default sqlite:////group/halld/www/halldweb/html/dist/ccdb.sqlite rcdbsqlite path: batch_default sqlite:////group/halld/www/halldweb/html/dist/rcdb.sqlite Producing file number: 6 Containing: 20000/50000 events Running location: ./ Output location: ./ Environment file: /srv/recon-2018_01-ver02_2.xml Analysis Environment file: /srv/analysis-2018_01-ver02.xml Context: variation=mc Reconstruction calibtime: notime Run Number: 42446 Electron beam current to use: .15006700000000000000 uA Electron beam energy to use: 11.6232 GeV Radiator Thickness to use: 58e-6 m Collimator Diameter: 50 m Photon Energy between 3.0 and 11.60 GeV Polarization Angle: 135.0 degrees Coherent Peak position: 8.85000000000000000000

Run generation step? 1 Will be cleaned? 1 Flux Hist to use: ccdb : unset Polarization to use: 0.4 : unset Using genr8 with config: /srv/696_yphi2pi_18.input

Run geant step? 1 Will be cleaned? 1 Using geant4 Custom Gcontrol? 0 Background to use: Random Random trigger background to use: recon-2018_01-ver02 BGRATE will be set to: rcdb GHz (if applicable) Run mcsmear? 1 Will be cleaned? 1

Run reconstruction? 1 Will be cleaned? 0 With additional plugins: file:/srv/696_jana.config

=======SOFTWARE USED======= MCwrapper version v2.3.0 MCwrapper location /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/gluex_MCwrapper/gluex_MCwrapper-v2.0.5 LDPRELOAD: /usr/lib64/libXrdPosixPreload.so Streaming via xrootd? 1 Event Count: 558281 BC /usr/bin/bc python /bin/python /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/halld_sim/halld_sim-4.5.0^r1801_2/Linux_CentOS7-x86_64-gcc4.8.5-cntr/bin/genr8 /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/hdgeant4/hdgeant4-2.4.0^r1801_2/bin/Linux-g++/hdgeant4 /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/halld_sim/halld_sim-4.5.0^r1801_2/Linux_CentOS7-x86_64-gcc4.8.5-cntr/bin/mcsmear /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/halld_recon/halld_recon-recon-2018_01-ver02/Linux_CentOS7-x86_64-gcc4.8.5-cntr/bin/hd_root

Finding the right file to fold in during MCsmear step gathering jana config file input file found configuring genr8 696_yphi2pi_18.input RUNNING GENR8 Setting random number seed to: 1570707733

BeamProperties: Parsing config file genr8_042446_006_beam.conf

BeamProperties: Using flux from CCDB run 42446

BeamProperties: Using fixed polarization = 0.4 Wrote event 1000^MWrote event 2000^MWrote event 3000^MWrote event 4000^MWrote event 5000^MWrote event 6000^MWrote event 7000^MWrote event 8000^MWrote event 9000^MWrote event 10000^MWrote event 11000^MWrote event 12000^MWrote event 13000 ^MWrote event 14000^MWrote event 15000^MWrote event 16000^MWrote event 17000^MWrote event 18000^MWrote event 19000^MWrote event 20000^MWrote 20000 events to genr8_042446_006.hddm RUNNING GEANT4 (11.0 events read) 0.0Hz (avg.: 0.0Hz) ^MJANA >> JANA >>Telling all threads to quit ... JANA >>Merging thread 0 (0x2b1f46a28700) ...

Closed HDDM file 0 event written to genr8_042446_006_geant4_smeared.hddm JANA >>Merging event reader thread ... JANA >> 0 events processed (11 events read) Average rate: 0.0Hz

Something went wrong with mcsmear status code: 134 Job finished with exit code 134

sdobbs commented 5 years ago

Hm, I'm not able to recreate this on the command line...

On Thu, Oct 10, 2019 at 11:52 AM nacer notifications@github.com wrote:

++++++++++++++++++++++ log ouputs with : recon-2018_01-ver02_2.xml ++++++++++++++++++

ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object '/$LIB/libkeepalive.so' from LD_PRELOAD cannot be preloaded: ignored. Using runNo: 42446 Maximum number of events: 20000 Opening file genr8_042446_006.ascii for output. Reading: targetp.x targetp.y targetp.z targetMass Found: 0.000000 0.000000 0.000000 0.938272 Reading: t-channelSlope Found: 1.230000 Reading: number of particles need to describe the decay Found: 8 Reading: part# chld1# chld2# prnt# Id nchld mass width chrg flag Found: 0 -1 -1 -1 14 0 0.938272 0.000000 1 11 Found: 1 2 3 -1 0 2 2.188000 0.083000 0 0 Found: 2 4 5 1 0 2 1.019461 0.004266 0 0 Found: 3 6 7 1 0 2 1.000000 1.000000 0 0 Found: 4 6 7 2 11 0 0.493677 0.000000 1 11 Found: 5 6 7 2 12 0 0.493677 0.000000 -1 11 Found: 6 6 7 3 8 0 0.139570 0.000000 1 11 Found: 7 6 7 3 9 0 0.139570 0.000000 -1 11 Found EOI---- Input File appears Fine. Calculating Lorentz Factor: 9000 ^MCalculating Lorentz Factor: 8000 ^MCalculating Lorentz Factor: 7000 ^MCalculating Lorentz Factor: 6000 ^MCalculating Lorentz Factor: 5000 ^MCalculating Lorentz Factor: 4000 ^MCalculating Lorentz Factor: 3000 ^MCalculating Lorentz Factor: 2000 ^MCalculating Lorentz Factor: 1000 ^MCalculating Lorentz Factor: 100 ^MCalculating Lorentz Factor: 90 ^MCalculating Lorentz Factor: 80 ^MCalculating Lorentz Factor: 70 ^MCalculating Lorentz Factor : 60 ^MCalculating Lorentz Factor: 50 ^MCalculating Lorentz Factor: 40 ^MCalculating Lorentz Factor: 30 ^MCalculating Lorentz Factor: 20 ^MCalculating Lorentz Factor: 10 ^MCalculating Lorentz Factor: 0 ^MEvents generated: 100 Events acce pted: 45 ^MEvents generated: 200 Events accepted: 99 ^MEvents generated: 300 Events accepted: 138 ^MEvents generated: 400 Events accepted: 182 ^MEvents generated: 500 Events accepted: 217 ^MEvents generated: 600 Events accepted: 261 ^MEv ents generated: 700 Events accepted: 295 ^MEvents generated: 800 Events accepted: 338 ^MEvents generated: 900 Events accepted: 372 ^MEvents generated: 1000 Events accepted: 406 ^MEvents generated: 10000 Events accepted: 3977 ^MEvents gen erated: 20000 Events accepted: 8048 ^MEvents generated: 30000 Events accepted: 11959 ^MEvents generated: 40000 Events accepted: 15957 ^MEvents generated: 50000 Events accepted: 19960 ^MMax Lorentz Factor:0.070956 Events generated:50129 E vents accepted:20000 Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. G4WT0 > Warning from GlueXDetectorConstruction::ConstructSDandField - unsupported sensitive volume TAC1 found in geometry definition. JANA ERROR>>thread 0 has stalled on run:42446 event:1 JANA ERROR>> Thread 0 (2b1f46a28700) hasn't responded in 3601 seconds. (run:event=42446:1) Cancelling ... JANA ERROR>> Caught HUP signal for thread 0x2b1f46a28700 thread exiting... JANA ERROR>> JANA ERROR>> Last thread to lock output file mutex: 0x2b1f46a28700 JANA ERROR>> Attempting to unlock mutex to avoid deadlock. JANA ERROR>> However, the output file is likely corrupt at JANA ERROR>> this point and the process should be restarted ... JANA ERROR>> Generating stack trace... JANA ERROR>> JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to JANA ERROR>> have the program relaunch a replacement thread when a stalled JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration JANA ERROR>> parameter to a value greater than zero. E.g.: JANA ERROR>> JANA ERROR>> jana -PJANA:MAX_RELAUNCH_THREADS=10 JANA ERROR>> JANA ERROR>> The program will quit now. : ####################### Job started: Thu Oct 10 11:35:10 UTC 2019 Simulating the Experiment: GlueX ccdbsqlite path: batch_default sqlite:////group/halld/www/halldweb/html/dist/ccdb.sqlite rcdbsqlite path: batch_default sqlite:////group/halld/www/halldweb/html/dist/rcdb.sqlite Producing file number: 6 Containing: 20000/50000 events Running location: ./ Output location: ./ Environment file: /srv/recon-2018_01-ver02_2.xml Analysis Environment file: /srv/analysis-2018_01-ver02.xml Context: variation=mc Reconstruction calibtime: notime Run Number: 42446 Electron beam current to use: .15006700000000000000 uA Electron beam energy to use: 11.6232 GeV Radiator Thickness to use: 58e-6 m Collimator Diameter: 50 m Photon Energy between 3.0 and 11.60 GeV Polarization Angle: 135.0 degrees Coherent Peak position: 8.85000000000000000000 Run generation step? 1 Will be cleaned? 1 Flux Hist to use: ccdb : unset Polarization to use: 0.4 : unset Using genr8 with config: /srv/696_yphi2pi_18.input Run geant step? 1 Will be cleaned? 1 Using geant4 Custom Gcontrol? 0 Background to use: Random Random trigger background to use: recon-2018_01-ver02 BGRATE will be set to: rcdb GHz (if applicable) Run mcsmear? 1 Will be cleaned? 1 Run reconstruction? 1 Will be cleaned? 0 With additional plugins: file:/srv/696_jana.config

=======SOFTWARE USED======= MCwrapper version v2.3.0 MCwrapper location /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/gluex_MCwrapper/gluex_MCwrapper-v2.0.5 LDPRELOAD: /usr/lib64/libXrdPosixPreload.so Streaming via xrootd? 1 Event Count: 558281 BC /usr/bin/bc python /bin/python

/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/halld_sim/halld_sim-4.5.0^r1801_2/Linux_CentOS7-x86_64-gcc4.8.5-cntr/bin/genr8

/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/hdgeant4/hdgeant4-2.4.0^r1801_2/bin/Linux-g++/hdgeant4

/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/halld_sim/halld_sim-4.5.0^r1801_2/Linux_CentOS7-x86_64-gcc4.8.5-cntr/bin/mcsmear

/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5-cntr/halld_recon/halld_recon-recon-2018_01-ver02/Linux_CentOS7-x86_64-gcc4.8.5-cntr/bin/hd_root

Finding the right file to fold in during MCsmear step gathering jana config file input file found configuring genr8 696_yphi2pi_18.input RUNNING GENR8 Setting random number seed to: 1570707733

BeamProperties: Parsing config file genr8_042446_006_beam.conf

BeamProperties: Using flux from CCDB run 42446

BeamProperties: Using fixed polarization = 0.4 Wrote event 1000^MWrote event 2000^MWrote event 3000^MWrote event 4000^MWrote event 5000^MWrote event 6000^MWrote event 7000^MWrote event 8000^MWrote event 9000^MWrote event 10000^MWrote event 11000^MWrote event 12000^MWrote event 13000 ^MWrote event 14000^MWrote event 15000^MWrote event 16000^MWrote event 17000^MWrote event 18000^MWrote event 19000^MWrote event 20000^MWrote 20000 events to genr8_042446_006.hddm RUNNING GEANT4 (11.0 events read) 0.0Hz (avg.: 0.0Hz) ^MJANA >> JANA >>Telling all threads to quit ... JANA >>Merging thread 0 (0x2b1f46a28700) ...

Closed HDDM file 0 event written to genr8_042446_006_geant4_smeared.hddm JANA >>Merging event reader thread ... JANA >> 0 events processed (11 events read) Average rate: 0.0Hz

Something went wrong with mcsmear status code: 134 Job finished with exit code 134

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_sim/issues/78?email_source=notifications&email_token=AAJAS2T2BTILHFA4ZFZSPXDQN5FVFA5CNFSM4I6U33VKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEA43EEQ#issuecomment-540652050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJAS2QUY5NWLMUF5NT5IR3QN5FVFANCNFSM4I6U33VA .

nacer-h commented 5 years ago

Hi Sean, did you include Random triggers also in your test ?

nacer-h commented 5 years ago

Here are the full 3 log outputs attached for one run out_yphifo_16_20190917080244am_11436_3.log OSG_yphifo_16_20190917080244am_11436_3.log error_yphifo_16_20190917080244am_11436_3.log

.

T-Britton commented 5 years ago

After investigation we narrowed it down to xrootd of some random trigger files. After putting in an override stalled jobs began running normally. The stall percentage seems to be related to the number of random trigger files which did not have owner write permissions. Wess seems to recall reading something about that in the past. One project (Alex A's) had 4 jobs that would NOT successfully finish; lo and behold that directory contained 4 files not containing owner write permissions. All files have been updated to include this permission and final checks are underway that this does indeed solve the issue. Closing the issue for now, will reopen if this appears not to be the case....