JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
6 stars 8 forks source link

hd_root hangs at the end of evio file with is_valid_run_end = false #418

Closed nsjarvis closed 2 years ago

nsjarvis commented 4 years ago

hd_root hangs at the end of (cosmics) run 72468 file 11 and RCDB shows that it ended with is_valid_run_end = false.

sdobbs commented 4 years ago

I can reproduce this, but unfortunately not when I'm running this in a debugger. I do notice that a thread seems to die early on in the processing, which is usually the sign of some EVIO processing problem - maybe problems with the cleanup is causing this to hang.

faustus123 commented 4 years ago

I am seeing this as well. I actually saw it while running hdskims on run 71851 file 122. I was able to reproduce the error with Naomi's file for run 72468 file 11. In both cases I was able to attach a debugger and peek at the stacks. The problem seems to be a deadlock in async_filebuf.cc between lines 109 and 145. Both are waiting on a notify event on the readloop_lock member of the async_filebuf class, but no other thread is running that can send it. I also see this at the end of processing the file.

The async_filebuf class was written by Richard so I will assign this to him to have a look at it. I'm attaching the backtraces of all 3 threads in one of the hung processes for reference.

tmp.txt

faustus123 commented 4 years ago

BTW: Richard, while you're in there fixing this...

There is another problem with hangs at program start if a nonexistent file is given to the HDEVIO constructor. This results in a similar deadlock, but with both threads at line 145 of async_filebuf.cc. The problem though is in the HDEVIO constructor in that it only checks if opening "/dev/null" succeeded, not if the actual file opening did. Thus, it always sets the is_open flag and does not indicate to the caller that an error occurred.

faustus123 commented 4 years ago

I am now seeing some stalls on NERSC jobs for partial file processing. For Run 71717, file 261 parts 0-3 ran OK, but parts 4-8 all timed out several times. I launched an interactive job so I could check the stack trace and saw threads stuck on the same line numbers (109 and 145) of async_filebuf.cc.

rjones30 commented 4 years ago

David,

How do I reproduce this? Do I need to give it a non-existent input file? Some particular malignant evio input file? Some set of input run numbers?

-Richard Jones

On Fri, Jul 31, 2020 at 9:50 AM David Lawrence notifications@github.com wrote:

I am now seeing some stalls on NERSC jobs for partial file processing. For Run 71717, file 261 parts 0-3 ran OK, but parts 4-8 all timed out several times. I launched an interactive job so I could check the stack trace and saw threads stuck on the same line numbers (109 and 145) of async_filebuf.cc.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/418#issuecomment-667129788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWECXH2RSBNI2IRO2CTR6LD23ANCNFSM4PEAGRVA .

nsjarvis commented 4 years ago

In the original post I gave the details of the evio file where I found the problem. I didn't realise it was malignant :D

On Fri, Jul 31, 2020 at 6:23 PM Richard Jones notifications@github.com wrote:

David,

How do I reproduce this? Do I need to give it a non-existent input file? Some particular malignant evio input file? Some set of input run numbers?

-Richard Jones

On Fri, Jul 31, 2020 at 9:50 AM David Lawrence notifications@github.com wrote:

I am now seeing some stalls on NERSC jobs for partial file processing. For Run 71717, file 261 parts 0-3 ran OK, but parts 4-8 all timed out several times. I launched an interactive job so I could check the stack trace and saw threads stuck on the same line numbers (109 and 145) of async_filebuf.cc.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub < https://github.com/JeffersonLab/halld_recon/issues/418#issuecomment-667129788 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AB3YKWECXH2RSBNI2IRO2CTR6LD23ANCNFSM4PEAGRVA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/418#issuecomment-667411124, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXOCVVQH3ZQ6WHMZFGZ3FTR6M77VANCNFSM4PEAGRVA .

nsjarvis commented 4 years ago

It's still in cache:

/cache/halld/RunPeriod-2019-11/rawdata/Run072468/hd_rawdata_072468_011.evio

nsjarvis commented 2 years ago

I checked this today w version set 5.5.0. It processed the file & exited properly after printing 'Error reading EVIO block header (at EOF - truncated?)'