JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
7 stars 9 forks source link

DL1MCTrigger Crashes With mysql rcdb/ccdb #81

Open jonzarling opened 5 years ago

jonzarling commented 5 years ago

It appears that the brun portion DL1MCTrigger is experiencing infrequent crashes when running over some MC datafiles using a mysql connection for rcdb (and ccdb). It works fine with local sqlite copies. Looks like it misreads character input sometimes or something?

See conversation over at https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/gluex-software/DMd1h5hfoYk/kaYPTyawGAAJ for a little more detail.

sdobbs commented 5 years ago

Could you please post the entire log file for a job which crashes? It could be an error which is not properly handled.

Also, are you running at JLab or somewhere else?

jonzarling commented 5 years ago

I am running on an IU cluster. I've been trying to reproduce on the jlab farm, but I'm running into external headaches with configuring things there.

Here's is an example for both a succeeded job and a failed one (these are pseudo-random--- rerunning typically works without issue):

photon_gun_failedjob.txt photon_gun_jobsucceeded.txt

sdobbs commented 5 years ago

So now this crash is happening when running off of an SQLite file? Otherwise, I would suggest you never run using the master MySQL offsite - the available bandwidth is too small.

Side note: It is not good that part of your context is "calibtime=timegoeshere". I would leave this undefined if you don't have a variable - I don't know what CCDB would do with this sort of value.

Anyway, t's a little tough to find since you're running multithreaded, but the thread dies because of this exception:

Exception: FCAL channel is not in the translation table Crate = 14 Slot = 18 Channel = 15

Why this does not happen consistently is a question that should be looked into.

On Mon, Jan 28, 2019 at 12:23 PM Jonathan Zarling notifications@github.com wrote:

I am running on an IU cluster. I've been trying to reproduce on the jlab farm, but I'm running into external headaches with configuring things there.

Here's is an example for both a succeeded job and a failed one (these are pseudo-random--- rerunning typically works without issue):

photon_gun_failedjob.txt https://github.com/JeffersonLab/halld_recon/files/2804105/photon_gun_failedjob.txt photon_gun_jobsucceeded.txt https://github.com/JeffersonLab/halld_recon/files/2804106/photon_gun_jobsucceeded.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/81#issuecomment-458222352, or mute the thread https://github.com/notifications/unsubscribe-auth/ABIJasNLOxBTDCC2uLs6PmdkMpvx8Sprks5vHzH0gaJpZM4aROFg .

jonzarling commented 5 years ago

I've been seeing and ignoring the error

Exception: FCAL channel is not in the translation table Crate = 14 Slot = 18 Channel = 15

as it seems to happen 100% of the time for me. You can see it in both succeeded and failed job logs. So I don't think that's at the heart of the issue

Ah, so I see in my configuration I was actually using a sqlite copy of ccdb, so that should confirm that the problem is with mysql versions of rcdb. Again, I don't ever encounter this error if I use a local sqlite version of rcdb.

markito3 commented 4 years ago

@jonzarling , is this still a problem.

jonzarling commented 4 years ago

Some jobs I ran in the past week using the version_4.12.0.xml (tagged last December) appear to still be encountering this issue.

aaust commented 3 years ago

I am experiencing the same problem when I run on of the plugins monitoring_hists,danarest or ReactionFilter over smeared MC files. The jobs fail with this error message:

JANA >>============================
JANA >> DL1MCTrigger: (brun)         --   line:281  /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.7.9p1^ccdb167/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JF
actory.h
JANA >> DL1MCTrigger
JANA >> DTrigger: (evnt)             --   line:299  /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.7.9p1^ccdb167/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JF
actory.h
JANA >> DTrigger
JANA >> JEventLoop:OneEvent  (evnt)  --   line:695  src/JANA/JEventLoop.cc
JANA >>----------------------------

It is not reproducible, but about half of the jobs fail the first try. After multiple retries, all jobs complete successfully.

I am using the latest software stack and these input files: /cache/halld/gluex_simulations/REQUESTED_MC/SDME_G4_FDC_20210131124244pm/hddm/*smeared.hddm

aaust commented 3 years ago

I can also confirm that the problem does not appear when reading RCDB from an sqlite file.

aaust commented 3 years ago

In a recent monitoring launch, I found a file that always fails with exactly the same error message above. It ONLY happens when I run with more than 1 thread and when I read CCDB from an sqlite file. I also use the halld_recon from the current master and RCDB from the mysql server.

How to reproduce? export JANA_CALIB_URL=sqlite:////group/halld/www/halldweb/html/dist/ccdb.sqlite

hd_root -PNTHREADS=2 -PPLUGINS=monitoring_hists /cache/halld/RunPeriod-2021-08/rawdata/Run081389/hd_rawdata_081389_000.evio

Crashes after 67 events

sdobbs commented 3 years ago

Out of curiosity, do you still see these crashes if you copy if the SQLite file to a local filesystem (e.g. /scratch ?). I tried this and didn't see the crashes, but it would be good to confirm.

I think there are improvements that could be made to the DL1MCTrigger factory, but I think all of these crashes are consistent with the connection to some database timing out...

aaust commented 3 years ago

That's right, I can confirm that the crash does not appear when reading the sqlite file from scratch.

aaust commented 8 months ago

This error is back! I am reconstructing 2017 data with the latest version_5.14.2.xml. I am reading the calibration from an ccdb sqlite that was copied to the local disk, but a significant fraction of the jobs fail (~10-30%). This was not the case with version_4.24.0.xml

Here are the full stdout and stderr messages again:

JANA >>============================
JANA >>============================
JANA >> DL1MCTrigger: (brun)         --   line:281  /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc
4.8.5/include/JANA/JFactory.h
JANA >> DL1MCTrigger: (brun)         --   line:281  /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc
4.8.5/include/JANA/JFactory.h
JANA >> DL1MCTrigger                 
JANA >> DL1MCTrigger                 
JANA >> DTrigger: (evnt)             --   line:299  /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JFactory.h
JANA >> DTrigger: (evnt)             --   line:299  /u/group/halld/Software/builds/Linux_CentOS7.7-x86_64-gcc4.8.5/jana/jana_0.8.2^ccdb1610/Linux_CentOS7.7-x86_64-gcc4.8.5/include/JANA/JFactory.h
JANA >> DTrigger                     
JANA >> DTrigger                     
JANA >> JEventLoop:OneEvent  (evnt)  --   line:695  src/JANA/JEventLoop.cc
JANA >>----------------------------
JANA >> JEventLoop:OneEvent  (evnt)  --   line:695  src/JANA/JEventLoop.cc
JANA >>----------------------------
JANA >>Telling all threads to quit ...
src/JANA/JEventLoop.cc:698 ESCsrc/JANA/JEventLoop.cc:698 ESC[1m[1m EXCEPTION : std::exceptionESC[0m
 EXCEPTION : std::exceptionESC[0m
src/JANA/JApplication.cc:1386 ESC[1msrc/JANA/JApplication.cc: EXCEPTION caught for thread 140467870082816 : std::exceptionESC[0m1386 ESC[1m
 EXCEPTION caught for thread 140468700567296 : std::exceptionESC[0m
JANA ERROR>> 
JANA ERROR>> Automatic relaunching of threads is disabled. If you wish to
JANA ERROR>> have the program relaunch a replacement thread when a stalled
JANA ERROR>> one is killed, set the JANA:MAX_RELAUNCH_THREADS configuration
JANA ERROR>> parameter to a value greater than zero. E.g.:
JANA ERROR>> 
JANA ERROR>>     jana -PJANA:MAX_RELAUNCH_THREADS=10
JANA ERROR>> 
JANA ERROR>> The program will quit now.
Exit code: 70