Open jonzarling opened 5 years ago
Could you please post the entire log file for a job which crashes? It could be an error which is not properly handled.
Also, are you running at JLab or somewhere else?
I am running on an IU cluster. I've been trying to reproduce on the jlab farm, but I'm running into external headaches with configuring things there.
Here's is an example for both a succeeded job and a failed one (these are pseudo-random--- rerunning typically works without issue):
So now this crash is happening when running off of an SQLite file? Otherwise, I would suggest you never run using the master MySQL offsite - the available bandwidth is too small.
Side note: It is not good that part of your context is "calibtime=timegoeshere". I would leave this undefined if you don't have a variable - I don't know what CCDB would do with this sort of value.
Anyway, t's a little tough to find since you're running multithreaded, but the thread dies because of this exception:
Exception: FCAL channel is not in the translation table Crate = 14 Slot = 18 Channel = 15
Why this does not happen consistently is a question that should be looked into.
On Mon, Jan 28, 2019 at 12:23 PM Jonathan Zarling notifications@github.com wrote:
I am running on an IU cluster. I've been trying to reproduce on the jlab farm, but I'm running into external headaches with configuring things there.
Here's is an example for both a succeeded job and a failed one (these are pseudo-random--- rerunning typically works without issue):
photon_gun_failedjob.txt https://github.com/JeffersonLab/halld_recon/files/2804105/photon_gun_failedjob.txt photon_gun_jobsucceeded.txt https://github.com/JeffersonLab/halld_recon/files/2804106/photon_gun_jobsucceeded.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/81#issuecomment-458222352, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIJasNLOxBTDCC2uLs6PmdkMpvx8Sprks5vHzH0gaJpZM4aROFg .
I've been seeing and ignoring the error
Exception: FCAL channel is not in the translation table Crate = 14 Slot = 18 Channel = 15
as it seems to happen 100% of the time for me. You can see it in both succeeded and failed job logs. So I don't think that's at the heart of the issue
Ah, so I see in my configuration I was actually using a sqlite copy of ccdb, so that should confirm that the problem is with mysql versions of rcdb. Again, I don't ever encounter this error if I use a local sqlite version of rcdb.
@jonzarling , is this still a problem.
Some jobs I ran in the past week using the version_4.12.0.xml (tagged last December) appear to still be encountering this issue.
I am experiencing the same problem when I run on of the plugins monitoring_hists,danarest or ReactionFilter over smeared MC files. The jobs fail with this error message:
JANA >>============================
JANA >> DL1MCTrigger: (brun) -- line:281 /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.7.9p1^ccdb167/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JF
actory.h
JANA >> DL1MCTrigger
JANA >> DTrigger: (evnt) -- line:299 /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.7.9p1^ccdb167/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JF
actory.h
JANA >> DTrigger
JANA >> JEventLoop:OneEvent (evnt) -- line:695 src/JANA/JEventLoop.cc
JANA >>----------------------------
It is not reproducible, but about half of the jobs fail the first try. After multiple retries, all jobs complete successfully.
I am using the latest software stack and these input files: /cache/halld/gluex_simulations/REQUESTED_MC/SDME_G4_FDC_20210131124244pm/hddm/*smeared.hddm
I can also confirm that the problem does not appear when reading RCDB from an sqlite file.
In a recent monitoring launch, I found a file that always fails with exactly the same error message above. It ONLY happens when I run with more than 1 thread and when I read CCDB from an sqlite file. I also use the halld_recon from the current master and RCDB from the mysql server.
How to reproduce?
export JANA_CALIB_URL=sqlite:////group/halld/www/halldweb/html/dist/ccdb.sqlite
hd_root -PNTHREADS=2 -PPLUGINS=monitoring_hists /cache/halld/RunPeriod-2021-08/rawdata/Run081389/hd_rawdata_081389_000.evio
Crashes after 67 events
Out of curiosity, do you still see these crashes if you copy if the SQLite file to a local filesystem (e.g. /scratch ?). I tried this and didn't see the crashes, but it would be good to confirm.
I think there are improvements that could be made to the DL1MCTrigger factory, but I think all of these crashes are consistent with the connection to some database timing out...
That's right, I can confirm that the crash does not appear when reading the sqlite file from scratch.
This error is back! I am reconstructing 2017 data with the latest version_5.14.2.xml. I am reading the calibration from an ccdb sqlite that was copied to the local disk, but a significant fraction of the jobs fail (~10-30%). This was not the case with version_4.24.0.xml
Here are the full stdout and stderr messages again:
JANA >>============================
JANA >>============================
JANA >> DL1MCTrigger: (brun) -- line:281 /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc
4.8.5/include/JANA/JFactory.h
JANA >> DL1MCTrigger: (brun) -- line:281 /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc
4.8.5/include/JANA/JFactory.h
JANA >> DL1MCTrigger
JANA >> DL1MCTrigger
JANA >> DTrigger: (evnt) -- line:299 /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JFactory.h
JANA >> DTrigger: (evnt) -- line:299 /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JFactory.h
JANA >> DTrigger
JANA >> DTrigger
JANA >> JEventLoop:OneEvent (evnt) -- line:695 src/JANA/JEventLoop.cc
JANA >>----------------------------
JANA >> JEventLoop:OneEvent (evnt) -- line:695 src/JANA/JEventLoop.cc
JANA >>----------------------------
JANA >>Telling all threads to quit ...
src/JANA/JEventLoop.cc:698 ESCsrc/JANA/JEventLoop.cc:698 ESC[1m[1m EXCEPTION : std::exceptionESC[0m
EXCEPTION : std::exceptionESC[0m
src/JANA/JApplication.cc:1386 ESC[1msrc/JANA/JApplication.cc: EXCEPTION caught for thread 140467870082816 : std::exceptionESC[0m1386 ESC[1m
EXCEPTION caught for thread 140468700567296 : std::exceptionESC[0m
JANA ERROR>>
JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to
JANA ERROR>> have the program relaunch a replacement thread when a stalled
JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration
JANA ERROR>> parameter to a value greater than zero. E.g.:
JANA ERROR>>
JANA ERROR>> jana -PJANA:MAX_RELAUNCH_THREADS=10
JANA ERROR>>
JANA ERROR>> The program will quit now.
Exit code: 70
It appears that the brun portion DL1MCTrigger is experiencing infrequent crashes when running over some MC datafiles using a mysql connection for rcdb (and ccdb). It works fine with local sqlite copies. Looks like it misreads character input sometimes or something?
See conversation over at https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/gluex-software/DMd1h5hfoYk/kaYPTyawGAAJ for a little more detail.