Closed nsjarvis closed 2 years ago
I can reproduce this, but unfortunately not when I'm running this in a debugger. I do notice that a thread seems to die early on in the processing, which is usually the sign of some EVIO processing problem - maybe problems with the cleanup is causing this to hang.
I am seeing this as well. I actually saw it while running hdskims on run 71851 file 122. I was able to reproduce the error with Naomi's file for run 72468 file 11. In both cases I was able to attach a debugger and peek at the stacks. The problem seems to be a deadlock in async_filebuf.cc between lines 109 and 145. Both are waiting on a notify event on the readloop_lock member of the async_filebuf class, but no other thread is running that can send it. I also see this at the end of processing the file.
The async_filebuf class was written by Richard so I will assign this to him to have a look at it. I'm attaching the backtraces of all 3 threads in one of the hung processes for reference.
BTW: Richard, while you're in there fixing this...
There is another problem with hangs at program start if a nonexistent file is given to the HDEVIO constructor. This results in a similar deadlock, but with both threads at line 145 of async_filebuf.cc. The problem though is in the HDEVIO constructor in that it only checks if opening "/dev/null" succeeded, not if the actual file opening did. Thus, it always sets the is_open flag and does not indicate to the caller that an error occurred.
I am now seeing some stalls on NERSC jobs for partial file processing. For Run 71717, file 261 parts 0-3 ran OK, but parts 4-8 all timed out several times. I launched an interactive job so I could check the stack trace and saw threads stuck on the same line numbers (109 and 145) of async_filebuf.cc.
David,
How do I reproduce this? Do I need to give it a non-existent input file? Some particular malignant evio input file? Some set of input run numbers?
-Richard Jones
On Fri, Jul 31, 2020 at 9:50 AM David Lawrence notifications@github.com wrote:
I am now seeing some stalls on NERSC jobs for partial file processing. For Run 71717, file 261 parts 0-3 ran OK, but parts 4-8 all timed out several times. I launched an interactive job so I could check the stack trace and saw threads stuck on the same line numbers (109 and 145) of async_filebuf.cc.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/418#issuecomment-667129788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWECXH2RSBNI2IRO2CTR6LD23ANCNFSM4PEAGRVA .
In the original post I gave the details of the evio file where I found the problem. I didn't realise it was malignant :D
On Fri, Jul 31, 2020 at 6:23 PM Richard Jones notifications@github.com wrote:
David,
How do I reproduce this? Do I need to give it a non-existent input file? Some particular malignant evio input file? Some set of input run numbers?
-Richard Jones
On Fri, Jul 31, 2020 at 9:50 AM David Lawrence notifications@github.com wrote:
I am now seeing some stalls on NERSC jobs for partial file processing. For Run 71717, file 261 parts 0-3 ran OK, but parts 4-8 all timed out several times. I launched an interactive job so I could check the stack trace and saw threads stuck on the same line numbers (109 and 145) of async_filebuf.cc.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub < https://github.com/JeffersonLab/halld_recon/issues/418#issuecomment-667129788 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AB3YKWECXH2RSBNI2IRO2CTR6LD23ANCNFSM4PEAGRVA
.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/418#issuecomment-667411124, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXOCVVQH3ZQ6WHMZFGZ3FTR6M77VANCNFSM4PEAGRVA .
It's still in cache:
/cache/halld/RunPeriod-2019-11/rawdata/Run072468/hd_rawdata_072468_011.evio
I checked this today w version set 5.5.0. It processed the file & exited properly after printing 'Error reading EVIO block header (at EOF - truncated?)'
hd_root hangs at the end of (cosmics) run 72468 file 11 and RCDB shows that it ended with is_valid_run_end = false.