JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
7 stars 9 forks source link

Analysis stops at ~500k events for Run 40129 #385

Closed keigo-miz closed 4 years ago

keigo-miz commented 4 years ago

Run 40129 is a cosmic-ray (field-OFF) run.

When I analyze this run, hd_root stops the analysis after ~530k events analysis, i.e. analysis speed just becomes 0.0 Hz w/o any error messages.

I put outputs of the command.

[ifarm1901:~]$ hd_root /cache/halld/RunPeriod-2018-01/rawdata/Run040129/hd_rawdata_040129_000.evio 
JANA >>OUTPUT_FILENAME: hd_root.root
Opened ROOT file "hd_root.root" ...
JANA >>Launching threads .
JANA >>Opening source "/cache/halld/RunPeriod-2018-01/rawdata/Run040129/hd_rawdata_040129_000.evio" of type: EVIOpp  - Reads EVIO formatted data from file or ET system
loading VERSION 3
JANA >>Control event: Prestart - Thu Nov 16 19:35:22 2017
JANA >>Control event: Go - Thu Nov 16 19:35:38 2017
  1.0 events processed  (2.0 events read)  2.0Hz  (avg.: 0.0Hz)     

JANA >>
JANA >> --- Configuration Parameters --
JANA >> THREAD_TIMEOUT = 30 seconds
JANA >> -------------------------------
  537.4k events processed  (537.4k events read)  0.0Hz  (avg.: 1.8kHz) 
sdobbs commented 4 years ago

My guess is that there's something weird with the DAQ in this run. I ran under a debugger, and seems like we've hit a thread deadlock (see typical backtraces below). This is not usual - we should look closer at the events around this region.

Also, does anyone know why only every fourth file is saved on tape?

Thread 19 (Thread 0x7fffcdffb700 (LWP 289259)):

0 0x00007ffff160cda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x0000000000f81834 in gthread_cond_timedwait (abs_timeout=0x7fffcdff43f0, mutex=, cond=0x7fffd00020f0)

at /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/gthr-default.h:871

2 wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (atime=..., __lock=..., this=0x7fffd00020f0) at /usr/include/c++/4.8.2/condition_variable:160

3 wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (atime=..., lock=..., this=0x7fffd00020f0) at /usr/include/c++/4.8.2/condition_variable:100

4 wait_for<long, std::ratio<1l, 1000l> > (rtime=..., lock=..., this=0x7fffd00020f0) at /usr/include/c++/4.8.2/condition_variable:132

5 DEVIOWorkerThread::Run (this=0x7fffd0002030) at libraries/DAQ/DEVIOWorkerThread.cc:98

6 0x00007ffff11ab070 in ?? () from /lib64/libstdc++.so.6

7 0x00007ffff1608e65 in start_thread () from /lib64/libpthread.so.0

8 0x00007ffff090e88d in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7fffcd0b8700 (LWP 289258)):

0 0x00007ffff160c9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x00007ffff11a782c in std::condition_variable::wait(std::unique_lock&) () from /lib64/libstdc++.so.6

2 0x0000000000f6d14b in async_filebuf::underflow (this=0x7fffd0004840) at libraries/DAQ/async_filebuf.cc:145

3 0x0000000000f6c9f8 in async_filebuf::seekpos (this=0x7fffd0004840, pos=..., which=std::_S_in) at libraries/DAQ/async_filebuf.cc:208

4 0x0000000000f6c75b in async_filebuf::seekoff (this=0x7fffd0004840, off=19999996848, way=, which=std::_S_in) at libraries/DAQ/async_filebuf.cc:169

5 0x00007ffff1176058 in std::istream::seekg(long, std::_Ios_Seekdir) () from /lib64/libstdc++.so.6

6 0x0000000000f2b737 in HDEVIO::readNoFileBuff (this=0x7fffd0004ae0, user_buff=0x7fff841be620, user_buff_len=347372, allow_swap=allow_swap@entry=false) at libraries/DAQ/HDEVIO.cc:530

7 0x0000000000f5f1eb in JEventSource_EVIOpp::Dispatcher (this=0x7fffd0000f70) at libraries/DAQ/JEventSource_EVIOpp.cc:372

8 0x00007ffff11ab070 in ?? () from /lib64/libstdc++.so.6

9 0x00007ffff1608e65 in start_thread () from /lib64/libpthread.so.0

10 0x00007ffff090e88d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fffd6925700 (LWP 289223)):

0 0x00007ffff160cda2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x0000000000f5ea4c in gthread_cond_timedwait (abs_timeout=0x7fffd691e340, mutex=0x7fffd0001038, cond=0x7fffd0001060)

at /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/gthr-default.h:871

2 wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (atime=..., __lock=, this=0x7fffd0001060)

---Type to continue, or q to quit--- at /usr/include/c++/4.8.2/condition_variable:160

3 wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (atime=..., lock=, this=0x7fffd0001060) at /usr/include/c++/4.8.2/condition_variable:100

4 wait_for<long, std::ratio<1l, 1000l> > (rtime=..., lock=, this=0x7fffd0001060) at /usr/include/c++/4.8.2/condition_variable:132

5 JEventSource_EVIOpp::GetEvent (this=0x7fffd0000f70, event=...) at libraries/DAQ/JEventSource_EVIOpp.cc:558

6 0x00000000010db8e7 in jana::JEventSource::GetEvent (this=0x7fffd0000f70, event=...) at src/JANA/JEventSource.cc:54

7 0x00000000010bfaba in jana::JApplication::ReadEvent (this=0x7fffffffa970, event=...) at src/JANA/JApplication.cc:824

8 0x00000000010bf810 in jana::JApplication::EventBufferThread (this=this@entry=0x7fffffffa970) at src/JANA/JApplication.cc:753

9 0x00000000010bfa99 in LaunchEventBufferThread (arg=0x7fffffffa970) at src/JANA/JApplication.cc:666

10 0x00007ffff1608e65 in start_thread () from /lib64/libpthread.so.0

11 0x00007ffff090e88d in clone () from /lib64/libc.so.6

aaust commented 4 years ago

Looks like there were some tests with the DAQ during this time:

https://logbooks.jlab.org/entry/3492661

sdobbs commented 4 years ago

Is there another cosmic run that Keigo can use instead?

keigo-miz commented 4 years ago

Hi Sean and Alex, Thank you for your comments.

I think I can use Run 40236 and 40344 for the CDC alignment purpose, if Run 40129 is problematic. I'm going to check these runs.

sdobbs commented 4 years ago

Hi @keigo-miz, did these other runs work for your purposes? Can we close this issue?

keigo-miz commented 4 years ago

Hi, they work without stopping. I'll close this issue. Thanks.