Open msimberg opened 4 days ago
@msimberg thanks for the bug report... this problem has been there for a long time, and I am not quite sure how best to resolve it. It has not been a priority since it only happens in very short programs, as you mentioned. The reason I am detaching the thread is that in this very same case, the process sometimes hangs forever trying to join that thread. Detaching the thread and accepting the occasional, though rare, crash for very short programs was a compromise. There is a race condition in these very short programs in which the main thread is exiting before the worker thread even has a chance to start. If you have a good suggestion to fix it, that would be great... it's possible that I could set up a signal between the two threads, so the main thread won't continue until the worker thread indicates that it is ready to go...?
Thanks @khuck for the response! I see... Do you have some idea of where the thread is when it hangs? I tried to comment out the detach
, but was not able to reproduce a hang (that doesn't mean it can't happen... I might just not have the right environment/configuration options etc.).
it's possible that I could set up a signal between the two threads, so the main thread won't continue until the worker thread indicates that it is ready to go
Naively I'd say the only place where it's be safe for the main thread to continue is after std::thread::join
returns, but are you saying there might be an earlier point in read_proc
where it's safe to destroy the this
/proc_data_reader
? The reading
and done
member variables are at least accessed in the while-loop in read_proc
so I guess the earliest it could signal the main thread to continue is after the while-loop (at least without refactoring).
In pika we have some tests that link to
libapex.so
but don't actually use apex, that often trigger a segmentation fault or similar late after main has exited. The commonality in these tests is thatmain
typically runs very quickly. This can be reproduced with a simple program which simply containsand links to apex, and rerunning it many times. I did e.g. this testing with
g++ -g -O0 -lapex main.cpp
using GCC 10, but I think it should translate to any compiler.The issue seems to be that when
main
exits quickly, apex also reaches cleanup quite quickly (https://github.com/UO-OACISS/apex/blob/6edfb929860bf8e46577353c908042e09a12b8d8/src/apex/apex_preload.cpp#L113), but theproc_data_reader
thread spawned in https://github.com/UO-OACISS/apex/blob/6edfb929860bf8e46577353c908042e09a12b8d8/src/apex/proc_read.h#L66 may run after (or while) theproc_data_reader
is destroyed (https://github.com/UO-OACISS/apex/blob/6edfb929860bf8e46577353c908042e09a12b8d8/src/apex/apex.cpp#L151).If I understand the code correctly this is because the thread is detached here: https://github.com/UO-OACISS/apex/blob/6edfb929860bf8e46577353c908042e09a12b8d8/src/apex/proc_read.h#L67. Is there any particular reason for detaching, especially given that it would be joined here: https://github.com/UO-OACISS/apex/blob/6edfb929860bf8e46577353c908042e09a12b8d8/src/apex/proc_read.h#L75? At least from a quick test, commenting out the
detach
, avoids the issue, but I don't know if it has other effects...Note: while the above test fails after running it enough times, I found it easy to reproduce the issue by adding a few sleeps:
Running this through valgrind, it reports with near certainty:
This was tested on apex develop (6edfb929860bf8e46577353c908042e09a12b8d8).
I see a few other
std::thread::detach
es in the codebase that may have the same issue, but I did not investigate those.