Open debaetsd opened 3 years ago
I'm looking into "3 - oiiotool (Timeout)" That is a MT deadlock that is most likely responsible for the other timed out tests
A few tips that may help in working our way through these tests:
First, when CI fails, our scripts are set up to create an "artifact" downloadable from the GitHub page that gives the status summary of the CI run. The artifact is a zip file containing the output of any individual tests that failed. Downloading this zip file and unpacking it locally is a good way to inspect exactly what the failed test output looked like (especially if you are investigating a test that seems to only fail during CI but you can't reproduce it locally). The actual test output will be in (say) build/<platform>/testsuite/<testname>
, and the reference output we are trying to match will be in the ref/
subdirectory underneath that.
Second, many of these "failures" (not the timeouts, but perhaps many of the others) are running correctly, but simply failed to exactly match the test output. This can happen for a number of reasons that are benign (for example, decoding some lossy file formats may get occasional single bit differences depending on the platform and the version of the decoding software). In this case, the "fix" is merely updating the reference output so that there is a match. But of course the trick is that you don't want to simply replace the existing reference output with the new Windows one, because then of course it may not match on Linux and those tests will start failing.
So the solution to the fact that reference output may be different for different platforms is that the way our test scripts are written, the test passes if the output matches any of the files found in the ref/
subdirectory of the test. Thus, for most tests, you will see an out.txt
or whatever, but sometimes you will see additional files like out-osx.txt
or out-libfoo3.4.txt
or whatever. So for many of these "failing" tests where the output seems plausibly correct, but is merely slightly different on Windows, it's fine to fix by simply adding a testsuite/<testname>/ref/out-win.txt
.
When it's images that are the output, we're using OIIO's idiff
with flags that allow just a teensy bit of low-bit errors in a few pixels. Sometimes, a failure is because it's slightly exceeding that. If so, you may be able to keep the existing reference output but just adjust the test to allow more slightly differing pixels. You can see how this is done here, where we have raised the image matching threshold quite a bit because libraw's debayering seems to change a lot from version to version (making different, but usually perceptually indistinguishible outputs). Be gentle raising these thresholds, you wouldn't want to inadvertently make it impossible to notice a real problem in the future because the images are allowed to differ by too much. In the cases where a small amount of diddling the diff thresholds is not enough to allow the CI and reference outputs to match, you can just check in a new output test image variant (like with the text files), giving it a name to distinguish it in a self-documenting way. For example, you can see in testsuite/oiiotool-text/ref
that several of the output images have multiple reference variants, due to the appearance of the text in the image varying a bit from one version of freetype to another.
A few more small improvements: https://github.com/OpenImageIO/oiio/pull/2798
Some remarks about the timeouts, I pinned down where the deadlock is but at this moment do not really know why.
I seen it happen in various oiio operations (idiff, iconvert, oiiotool,..) but always when interacting with openexr files. never seen it trigger when running the tests synchronous (CTEST_PARALLEL_LEVEL = 1) but fairly common in parallel execution. Pretty sure this is just scheduling related (as inter-process communication /syncing/sharing is not a thing AFAIK)
The deadlock is happening during destructor of the IlmThread::ThreadPool. The DefaultThreadPoolProvider is syncing some semaphores and somehow this goes haywire. As the threadpool is a static (singleton), this destruction is post main function (not sure if that is related but more an observation).
Commenting out the content of pvt::set_exr_threads
inside openexr.imageio\exrinput.cpp
seems to fix the deadlocks (though that workaround might have other fallout).
So either this is a bug in the OpenEXR Windows threading logic or in the way we interact with it.
To be continued in 2021 ;)
So it never happens with CTEST_PARALLEL_LEVEL=1 (when the number of threads would be chosen to be equal to the number of cores), but only when multiple tests schedule at the same time. It seems that there are only 2 cores per GHA instance, and we are using CTEST_PARALLEL_LEVEL=4 by default, and each of those will see that it's running on a machine with 2 cores. So basically, 8 threads are probably running on a 2 core virtual machine -- that's a pretty "over-threaded" situation and may be revealing subtle bugs (perhaps in OpenEXR's thread pool?) that would never show up on a real user machine under ordinary conditions. I don't know.
As a temporary workaround, I would recommend that just for the Windows tests, maybe turn off parallel testing, e.g. in ci.yml in the env:
section of the Windows tests only, set CTEST_PARALLEL_LEVEL=1
and then at least we won't get these timeouts while we are getting all the other tests working, and then later we can come back to this. (I wonder if it would also work reliably for CTEST_PARALLEL_LEVEL=2?)
Another experiment to try is to set CTEST_PARALLEL_LEVEL even higher and see if that triggers problems on Linux or Mac.
Another experiment: If the problem is specifically in OpenEXR's ThreadPool destructor (which is called only at our application shut-down?), I wonder if we can circumvent it by pre-emptively calling set_exr_threads (to... 1?) as part of our shutdown procedures?
As an ultimate solution, if we never see the problem other than on Windows, we could just avoid (on that platform only?) ever calling set_exr_threads. That would prevent some fine control over exr threading that is nice to have, but I suppose that's better than deadlocks.
I tried changing CTEST_PARALLEL_LEVEL to 2 (instead of 4) for the Windows tests, and instead of 6 tests failing, only 1 test fails (though, not a consistent one, so definitely still a problem).
I'm going to submit a small PR that sets CTEST_PARALLEL_LEVEL=1 for the Windows CI tests, that will get us unstuck for now to fix up the rest of the Windows CI and get it passing, and then we can return later to investigate a more permanent solution to this problem.
I'm going to merge my other Windows CI fix as well, just to reduce the number of failures a bit more.
I tried CTEST_PARALLEL_LEVEL=1 and still get occasional timeout failures (not always the same ones). So I think the multiple parallel tests exacerbates the problem but is not the cause. I also tried bumping the version of OpenEXR used for the Windows CI tests from 2.4.1 to 2.5.4, that also did not help. With my quick fix ideas not panning out, I'll leave the deadlock investigation for you to pick up when you get a chance. Don't spin your wheels for too long on this -- I think it's also ok to avoid the problem for now by simply disabling the adjustment of exr_threads (just for Windows) until we get it all sorted out.
I figured out the deadlock.
It all boils down to the use of C-style exit
inside the tools error handling. If this happens on a worker thread, it leaves the OpenEXR threadpool confused and dazed. Though that is somewhat to be expected given that man for exit literally states that it is not thread-safe.
The OpenEXR threadpool needs either to have different synchronization or a "are my threads still alive" check (just like the OIIO thread pool :) ). Either way, it is something I will bring up on the OpenEXR project (as I can trigger it using 4 lines of non-OIIO code) though it remains a bit of "do not do that" territory imho.
On how to continue for getting the tests stable, we can roughly go three ways about it:
a workaround like set_exr_threads (1)
during shutdown would not really work. It would need to be sprinkled everywhere in the code base and I think it would not solve anything (shrinking the thread count will execute the exact same code as the d'tor).
Obviously we can also combine some of the options or something completely different I would propose a combo 1+3 (to get the CI up and running) with 2 planned for later. @lgritz what do you prefer?
Wow, fantastic detective work!
(1) is something we're working toward anyway, but note that some tests specifically are designed to make sure we do the right thing for certain errors, so this may not solve all the problems.
(2) I like this, but I need look more at how we can do that. Let me take a look at the oiiotool code in particular and see if there's a simple way to avoid the exit() that happens when an error is encountered.
(3) I'm not thrilled about this one. Why wouldn't it happen to users if they encounter errors? Making it nice only for CI tests doesn't seem like a robust solution. And disabling exr threading for Windows is also not ideal, we get a lot of improved perf because of the exr threading.
I do think we should report this to OpenEXR, maybe there is a way to fix on that side of the fence (though we still have OIIO built against old OpenEXR by many users, and test against it, so exr-side changes are a long term fix).
That seems to mostly leave choice (2) as the only comprehensive solution, as far as I can see.
Are you able to get a stack trace and figure out exactly which 'exit' call are the ones that are having the trouble? I'm a bit confused, since no exit calls in the OIIO code would be getting called by one of the OpenEXR worker threads -- we never submit jobs to the OpenEXR pool explicitly, so the only way it's used is running code internal to OpenEXR. Is it OpenEXR code that is calling exit()?
I have seen it trigger from Oiiotool::error
and that was not on a OpenEXR thread, just not the main thread (what is indeed confusing naming on my behalve...).
int main(int argc, char** argv) { std::thread t([]() {exit(-1); }); Imf::setGlobalThreadCount(5); t.join(); }
--> this triggers the exact same issue yet when moving the exit
call onto the MT, just never triggers... That's how I figured it must be workerthread related
and on a related fun (?) note, I'm not able to trigger my repro case anymore when using the latest OpenEXR master. Trying atm to see what happens if I run oiio with latest OpenEXR master
One thing I don't understand is... why aren't all the Windows users complaining about this?
Not sure, I guess because it is rather rare to trigger? The changes that any of the OpenEXR worker threads have not run yet when calling exit are rather small? I see if I can do some additional tests to clarify it.
I also started an issue with OpenEXR as the repro case is rather trivial https://github.com/AcademySoftwareFoundation/openexr/issues/890
I also found out why ``131 - unit_imagebuf``` fails
The reason that test was failing was because it failed to delete a tif file that was opened for reading. On Windows you cannot just do that (you need to specify the correct flags when opening the file). While using an other format (PNG for example) lets the test pass, the issue for tiff files still remain.
I patched the libtiff file IO here https://github.com/debaetsd/oiio/commit/ee50a3aebd06e182602b8a757af7eb95022207cc though ideally we should get a patch in libtiff itself
I think we should suggest this means of opening the file to libtiff, but not put all this Windows-specific code in our tiffinput.cpp. (Hint that it's an ugly approach: we may need to do i separately for a bunch of different formats.)
Our use of remove is problematic -- because we previously read the file with an ImageBuf, it's possible that it's backed by ImageCache which is holding an open ImageInput for that file. It's unusual and ill-advised to try to remove a file that your own process is using. The right solution is to fix the test, by making sure to tell the cache to close the file before removing it. Let me test a quick fix and if it works I'll submit a PR.
o yeah it is definably something that should be fixed in libtiff, I was more experimenting with it in that commit.
Forcing the files to be closed in that test is rather trivial (just move the ImageBuf objects into local C++ scopes and rely on the d'tor to clean up). That was actually my first fix but that I figured that the point of that test could have been to see if it was detected that the file was changed (on disk). Though perhaps remove is a different use case then simply updated.
ImageBuf is not going to correctly (whatever that means) handle a file that changed during the IB's continuing use of it. The best we can hope for is that if the IB goes away, and then a second IB references the (now changed) file, the second one should get the new pixels, not the old pixels that were in the ImageCache from the first time around. I believe that's what that section of the test is trying to verify.
I've made some progress on the timeout situation and a few other things. Will try to get to a PR tomorrow.
Still seeing those deadlocks confuses me more than I'm willing to admit. Having another look at it...
I've been knocking off failures (unrelated to the deadlocks) one by one as I got free time over the last few weeks: #2821 #2826 #2835 #2836 #2838
I'm pleased to report that I just submitted #2840, which makes the Windows CI pass!
Caveats:
We still have timeouts, and CI passes only because of the changes I made in #2821 that will run a test with timeout multiple times. The deadlocks are intermittent enough that repeating 3 or 4 times is likely to pass. This is hardly a real solution, so there is definitely something to solve with the deadlock. Maybe the only solution is fixing OpenEXR, but it will be disappointing if that's the case and we can't find a workaround on our side, since there will certainly be cases of people who need to build against older OpenEXR.
Our Windows CI only installs a minimal set of dependencies, and so a lot of OIIO functionality is disabled and not tested by these CI runs on Windows. We should strive to enable more dependencies so that we are eventually approaching having the Windows CI test overall OIIO functionality as thoroughly as we do on Linux. Most of this is just doing the work to get the dependencies set up, but we need to be careful to not have the building of dependencies (plus building OIIO itself and running all the tests) exceed the 60 minute limit to the CI job. It would be great to have installable binaries so we aren't building all dependencies from scratch every time.
Even with the new OpenEXR 2.5.5 that incorporates the fix to their threadpool, we're still seeing these deadlocks inside OpenEXR during certain OIIO tests. We usually "pass" the Windows CI overall, but only because every test in our testsuite is imposing a 2 minute timeout and repeating the up to 5 times before giving up with a failure.
I'm not quite sure what to try next. Without a Windows machine, it's hard for me to "catch it in the act" and understand where it's getting jammed up.
@debaetsd Can you try this again with either OpenEXR master or the 2.5.5 release (both of which have the "fixed, we thought" threadpool) and see if a stack at the point where it's wedged reveals any clues about where and how it's getting stuck?
I'll try to give it a go later this week
Finally got around on testing it some more (a newborn baby is not good for productivity :) ) Callstack is pretty much unchanged (using OpenEXR master):
Main Thread
IlmThread-2_5.dll!IlmThread_2_5::Semaphore::wait
ntdll.dll!00007ff8ee1ac6a4()
KernelBase.dll!00007ff8ebe2a34e()
IlmThread-2_5.dll!IlmThread_2_5::Semaphore::wait() Line 109
IlmThread-2_5.dll!IlmThread_2_5::`anonymous namespace'::DefaultThreadPoolProvider::finish() Line 424
IlmThread-2_5.dll!IlmThread_2_5::ThreadPool::Data::~Data() Line 589
IlmThread-2_5.dll!IlmThread_2_5::ThreadPool::~ThreadPool() Line 785
ucrtbase.dll!00007ff8ebc542d6()
ucrtbase.dll!00007ff8ebc541fb()
ucrtbase.dll!00007ff8ebc541b4()
IlmThread-2_5.dll!dllmain_crt_process_detach(const bool is_terminating) Line 182
IlmThread-2_5.dll!dllmain_dispatch(HINSTANCE__ * const instance, const unsigned long reason, void * const reserved) Line 293
ntdll.dll!00007ff8ee147e1d()
ntdll.dll!00007ff8ee173e31()
ntdll.dll!00007ff8ee173ccd()
kernel32.dll!00007ff8ede3e0ab()
ucrtbase.dll!00007ff8ebc605bc()
ucrtbase.dll!00007ff8ebc6045f()
idiff.exe!__scrt_common_main_seh() Line 295
kernel32.dll!00007ff8ede37034()
ntdll.dll!00007ff8ee15d241()
Note that only the main thread is still running at this point I also saved the dump so if needed I can share this (though without VS installed, it might not be that useful)
I also did some digging into OpenEXR itself and I think I found a solution (both local runs as gha CI seems to confirm it) More details in the OpenEXR deadlock topic https://github.com/AcademySoftwareFoundation/openexr/issues/890
@debaetsd can you tell how much of this is fixed by #3582?
I haven't looked into this for a couple months but I'll see if I can get it back up and running again
Describe the bug
This issue is to keep track of failing tests when ran inside the Windows CI. This list is found by running the CI with #2796.
The current list of failing tests is:
Full CI log (incl more and detailed errors on why some tests are failing) attached to this issue oiio-log.txt
This issue is mainly to keep track on what tests are passing and/or beeing worked on. Please reply here if you are working on a specific test and/or have a solution.
To Reproduce Run the specific test on windows and see it fail When running the CI locally, make sure to launch it from "Git for Windows" bash shell (to mimic the GH actions shell)
Expected behavior all test pass