SyneRBI / SIRF

Main repository for the CCP SynerBI software
http://www.ccpsynerbi.ac.uk
Other
58 stars 29 forks source link

occasional Bad file descriptor in cGadgetron #641

Open KrisThielemans opened 4 years ago

KrisThielemans commented 4 years ago

This job https://travis-ci.org/github/SyneRBI/SIRF-SuperBuild/jobs/679959452#L16334 from https://github.com/SyneRBI/SIRF-SuperBuild/pull/377 (which is a DEVEL build) fails, while others are fine. The error is in the MR test

ERROR: test3.test_main
...
error: ??? "'write: Bad file descriptor' exception caught at line 545 of /Users/travis/build/SyneRBI/SIRF-SuperBuild/sources/SIRF/src/xGadgetron/cGadgetron/cgadgetron.cpp; the reconstruction engine output may provide more information"
-------------------- >> begin captured stdout << ---------------------
File: /Users/travis/build/SyneRBI/SIRF-SuperBuild/INSTALL/python/sirf/Gadgetron.py
Line: 1384
check_status found the following message sent from the engine:

I'll rerun the job, as I guess this won't happen again, but it is worrying nevertheless.

@evgueni-ovtchinnikov any ideas?

evgueni-ovtchinnikov commented 4 years ago

I have seen this error message in Travis logs many times, but the reported error never ever happened locally, so impossible to investigate, I am afraid.

johannesmayer commented 4 years ago

I get this one, locally, from time to time. However, I cannot reproduce it.

KrisThielemans commented 4 years ago

hmmm. this is going to be tough then. Any ideas for writing some debugging checks and doing a special test-run with 1000 tests and see when it fails?

rijobro commented 4 years ago

@evgueni-ovtchinnikov I haven't looked through the source code, but if this pertains to file writing, could you put it in a for loop? Similar to what you already do for trying to connect to the gadgetron server)?

bool success = false;
unsigned num_attempts = 5;
for (unsigned i=0; i<num_attempts; ++i) {
    try {
         success = do_the_thing_that_causes_the_error();
    }
    catch {}
    if (success) break;
}
if (!success)
    throw std::runtime_error("bad file descriptor");
evgueni-ovtchinnikov commented 4 years ago

@johannesmayer: if you get this error when running your mrtest.cpp, then one possible culprit is your MRAcquisitionData::read, where you create ISMRMRD::Dataset and call its methods readHeader, getNumberOfAcquisitions and readAcquisition without Mutex locking/unlocking.

I have very little idea what Mutex does - something to do with multithreading - but I noticed Gadgetron was using it, so I just followed suit, see e.g. AcquisitionsFile::get_acquisition.

@rijobro: what you suggest looks like papering over the crack, I am afraid. I would try to investigate a bit more before resorting to your fallback.

evgueni-ovtchinnikov commented 4 years ago

added missing mutex locks/unlocks, HTH

rijobro commented 4 years ago

I have very little idea what Mutex does - something to do with multithreading - but I noticed Gadgetron was using it, so I just followed suit, see e.g. AcquisitionsFile::get_acquisition.

Mutex is used to stop multiple threads accessing the same files/variables simultaneously, leading to data races, etc.

So it could well be that missing mutex's solve the problem. Thanks.

rijobro commented 4 years ago

Bug still persisting (PR from today): https://travis-ci.org/github/SyneRBI/SIRF/jobs/703951360#L28836