bsc-performance-tools / extrae

Instrumentation framework to generate execution traces of the most used parallel runtimes.
https://tools.bsc.es/extrae
GNU Lesser General Public License v2.1
62 stars 38 forks source link

Infinite loop in signal handler #19

Closed devreal closed 4 years ago

devreal commented 6 years ago

I am having trouble terminating Extrae (v3.5.4) in case of an error, e.g., MPI error or segmentation fault. I am seeing an endless stream of messages like the following:

Extrae: Attention! Signal 11 (Segmentation fault) caught. Notifying to flush buffers whenever possible.
Extrae: Attention! Signal 11 (Segmentation fault) caught. Notifying to flush buffers whenever possible.
Extrae: Attention! Signal 11 (Segmentation fault) caught. Notifying to flush buffers whenever possible.

Repeated attempts to quick the application using Ctrl+C don't help. My only option was to kill the interactive session...

Looking at the signal handler SigHandler_FlushAndTerminate the above line is printed every time a signal is caught after the first signal has been caught. While signal handlers are quite limited in what they should do (I/O is not part of it) it is good to attempt to flush the buffers before exiting. However, if the application is in an invalid state more signals might occur, leading to this infinite loop. The signal handler should instead give up (potentially after N signals caught) and exit the application without flushing the buffer to avoid waiting for the application to run into either the walltime limit or the filesystem quota (the log file grows fairly quickly...).

Even for SIGINT, the signal handler should not wait for the flush to finish. The user may want to interrupt flushing buffers, e.g., because he deems the measurements unusable based on the output of the application. In all cases, an infinite loop should be avoided.

devreal commented 4 years ago

I just ran into the exact same issue and remembered that I filed this issue about a year ago, which was closed through https://github.com/bsc-performance-tools/extrae/commit/a16ccf27093632bfabae183a78c9c8f7fc308189. Looking at the patch, I wonder where the undocumented magic number 10 comes from? Is there any reason to assume that if I/O fails 9 times it will succeed on a tenth attempt? Is this some sort of easter-egg where I really have to convince Extrae (11 times) to abort? IMO Extrae should abort immediately if a) I tell it to (by sending SIGINT), without trying to outsmart me; or b) any critical signal occurs, after which I/O is not a safe operation anymore. In fact, any data written for an aborted job is a waste of cycles and storage space.

For some reason I cannot reopen this ticket so I'm just leaving this comment here.

emercadal commented 4 years ago

Extrae captures signals and tries to dump all instrumented data to files and create a tracefile without losing any of the captured program's activity. If you don't want Extrae to "outsmart you", you can disable signal handling in the XML as explained in section 4.12 of the manual.

Anyway, if you could send us a snippet of code presenting this issue we would appreciate it.

devreal commented 4 years ago

@emercadal Thanks for the reply. I wasn't aware of the XML configuration option so I will make that the default in my config files.

As I said earlier, performing I/O in signal handlers is unsafe. I understand the goal of rescuing whatever data Extrae has collected so far but there is no reason to assume that once that fails (or the user purposefully interrupts) it will succeed on the second, third, ..., tenth try. Intercepting signals and attempting to dump the buffer is fine but please only try that once and exit if I send another signal (after all, it is my or the MPI implementation's stated desire to interrupt whatever is going on right now).

There is no code snippet I can provide as this is purely a problem inside Extrae. The issue should be universal and independent of the used application or system configuration.