Gracefully handling termination in instrumented binaries

aantron / bisect_ppx

Code coverage for OCaml and ReScript

http://aantron.github.io/bisect_ppx/demo/

MIT License

302 stars 60 forks source link

Gracefully handling termination in instrumented binaries #384

Closed arvidj closed 2 years ago

arvidj commented 3 years ago

In our system testing framework, we launch processes instrumented with bisect. Normally, these processes terminate normally, writing coverage data as expected. However, some tests may leave unterminated processes, at which point the framework terminates all dangling processes through SIGTERM. Then, one of two things may happen:

if the unterminated process has not yet started terminating (and if it has no SIGTERM handler of its own eventually calling exit), the process will terminate without writing coverage data
if the unterminated process has started terminating, triggering bisect's termination hook, we're seeing corrupted coverage traces (either completely empty, or containing corrupt data).

In the first case we're getting an incomplete coverage measure (as the unterminated process has indeen been tested even though not cleanly terminated). In second, we have the same situation but in addition, bisect-ppx-report throws exceptions.

I don't see any simple solution to this problem (except making sure processes are terminated more gracefully in our systems testing framework), so consider this more of a discussion issue. It seems relatively simple to add a signal handler dealing with the first case, but as you note in https://github.com/aantron/bisect_ppx/issues/122:

My guess would be that Bisect can't make decisions about threads or signal handlers inside its own runtime. That's too likely to interfere with applications.

For the second case, perhaps the at_exit hook can be made more atomic?

aantron commented 3 years ago

Is part of your testing framework linked with the code under test? If so, you could link in a SIGTERM handler that calls dump_counters_exn followed by reset_counters:

https://github.com/aantron/bisect_ppx/blob/c632ed972a80a4dd32c0ae8a993303160c1c4f26/src/runtime/native/runtime.mli#L66-L70 https://github.com/aantron/bisect_ppx/blob/c632ed972a80a4dd32c0ae8a993303160c1c4f26/src/runtime/native/runtime.mli#L72-L73

AFAIK we can't make the at_exit hook considerably more atomic in the face of signals. We could mask SIGTERM, but as in #122, I worry about that interfering with the user's program excessively.

For when the framework is not linked into the code under test, we could add a flag to the PPX that allows the generated code to install signal handlers or mask signals. For example,

(instrumentation
 (backend bisect_ppx --sigterm --mask))

...might cause the runtime to install a SIGTERM handler that will write coverage data, and also mask signals during the coverage writing hook.

However, I think it would be best if the testing framework could take care of that. It seems to me that even for blackbox testing, you could link something into your binaries under test that will install the right signal handlers for your testing environment.

bisect-ppx-report should be made more resistant to corrupted coverage files. I'll look into this shortly, and I opened #385 to track it. I think the combination of a signal handler and a robust bisect-ppx-report should solve 2, since the signal handler will interrupt the at_exit hook and write the coverage statistics to a different file. The file being written by the at_exit hook will become corrupt, but an improved bisect-ppx-report will know to ignore it.

arvidj commented 2 years ago

In one of our tests touched by this issue, I've worked around the problem by handling signals explicitly in the instrumented process: https://gitlab.com/tezos/tezos/-/merge_requests/3792/

I think this is the way to go for tests that launch their own processes where the code of the child processes is "part of the test itself". For tests that launch separate binaries, I think one should:

take care to let those binaries terminate gracefully
ensure that those binaries already have signal handling in place that triggers at_exit hook on termination

I agree with you assessment that it would be messy to have bisect mix with the instrumented process's signal handling by default. Nonetheless, I'm afraid we'll have to add such a thing due to potentially large number of binaries we have in our project that do not handle signals at all (or improperly) (the second point above). We might not be able to go through and correct them on a case-by-case basis. Instead, it'd be good to have a blanket solution in the form of the --mask and --sigterm options you evoke above. I've thought a bit on how to implement those options in bisect here. I'd be happy for your thoughts before I attempt an implementation.

Another problem is that we have a fair amount of tests and a fair amount of different binaries. Going through them all and making sure coverage data is written for each launched process is infeasible to do by hand. I touch upon a solution in the issue I link above. What if instrumented binaries create an empty coverage file at initialization, which is filled at termination? In this way, we would have a clear marker (an empty file) indicating that something "should've been" written.

aantron commented 2 years ago

Thanks! I've read your proposal. The file lock idea seems like it would have a slightly complicated interaction with the runtime's ability to write .coverage files on demand.

We might be able to make this simpler by just having both at_exit and the signal handler blindly write .coverage files. The file names are random, so there should be no collision between the two writing "threads." Before at_exit starts writing a file, it can store its name in a ref. The signal handler can use the ref to delete at_exit's file, to prevent visit double counting by the reporter.

Going through them all and making sure coverage data is written for each launched process is infeasible to do by hand. I touch upon a solution in the issue I link above. What if instrumented binaries create an empty coverage file at initialization, which is filled at termination? In this way, we would have a clear marker (an empty file) indicating that something "should've been" written.

Could having the Bisect runtime handle SIGTERM make coverage writing reliable enough to make this unnecessary?

If not, we can probably have the runtime create some empty files that can be counted, and compared with the .coverage files — a variation of what you are proposing.

aantron commented 2 years ago

The other thing to note is that --mask is intended to turn off signal handling completely during at_exit, in which case it should be unnecessary to have any kind of locking between the signal handler and at_exit.

aantron commented 2 years ago

(--mask should probably just be part of --sigterm, since --sigterm would then only be safe with signals masked during at_exit)

arvidj commented 2 years ago

We might be able to make this simpler by just having both at_exit and the signal handler blindly write .coverage files. The file names are random, so there should be no collision between the two writing "threads." Before at_exit starts writing a file, it can store its name in a ref. The signal handler can use the ref to delete at_exit's file, to prevent visit double counting by the reporter. We might be able to make this simpler by just having both at_exit and the signal handler blindly write .coverage files. The file names are random, so there should be no collision between the two writing "threads." Before at_exit starts writing a file, it can store its name in a ref. The signal handler can use the ref to delete at_exit's file, to prevent visit double counting by the reporter.

I think three different scenarios are possible:

the signal handler executes and terminates before at_exit (in this case at_exit is never called), or vice versa. (1a and 1b below)
at_exit starts executing inside the signal handler (scenarios 2a and 2b below).
the signal is received inside the at_exit

Scenario 1a:

`at_exit`      :   <--->
signal handler :          never runs

Scenario 1b:

`at_exit`      :          never runs
signal handler :   <--->

Scenario 2a:

`at_exit`      :      <------>
                       ^- writes the ref
signal handler :   <-------------->
                                 ^ removes the ref
Scenario 2b:

`at_exit`      :            <-------------->
                                    ^- writes the ref
signal handler :   <-------------->
                                 ^- finds nothing to remove
Scenario 3a:

`at_exit`      :   <-------------->
                    ^ masks signal
signal received:       | 
signal handler :       (masked)

Scenario 3b:

`at_exit`      :   <-------------->
                      ^ masks signal, writes the ref
signal received:    | 
signal handler :    <-------->

Scenario 1 should pose no issues.

In scenario 2, at_exit will write it's coverage to some file and store a ref to the file. If it has finished doing before the termination of the signal handler, then the signal handler can remove that file (2a above). However, if at_exit is slower than the signal handler (2b above), then will the signal handler know that there is a file to remove?

If the signal is recieved after the start of at_exit, then the signal handler should be inhibited by the masking (3a above). But, is the following scenario is possible (3b above): 1) at_exit starts 2) signal received 3) at_exit masks?

aantron commented 2 years ago

AFAIK scenarios 2a and 2b are impossible — at_exit cannot start running during the signal handler (unless somehow called or directly triggered by the signal handler itself). The signal handler would have to finish in order for the "main process" to resume and trigger at_exit in any other way. The signal handler "takes over" the main thread, so once it starts running, in the case of the --sigterm handler, it is guaranteed to write the .coverage file and exit the process without triggering any "other" at_exit.

Scenario 3b is impossible (or at least not dangerous) if at_exit first masks SIGTERM before doing any of its other work. Then, if SIGTERM races in before the filename is chosen or the file is created, the SIGTERM will take over the thread and the thread will never return to at_exit anyway. If SIGTERM is too late, it is already masked, and at_exit will finish.

When choosing a filename, it's probably important to first write it to the ref, and then try to open/create the file. The SIGTERM should then attempt to remove the file named in the ref, and proceed silently if the file turns out not to exist.

aantron commented 2 years ago

Actually, we probably don't even need the ref at all, if we do

if at_exit first masks SIGTERM before doing any of its other work. Then, if SIGTERM races in before the filename is chosen or the file is created, the SIGTERM will take over the thread and the thread will never return to at_exit anyway. If SIGTERM is too late, it is already masked, and at_exit will finish.

Since SIGTERM will either come first, or will be masked during the entire process of choosing a filename and writing the file.

arvidj commented 2 years ago

Yes, I agree with your assessment. I'll do some experiments with the approach you suggest.

If not, we can probably have the runtime create some empty files that can be counted, and compared with the .coverage files — a variation of what you are proposing.

I tried this, and it turns out to be complicated. As we typically fork in our instrumented binaries, there is no set "starting point" of each instrumented process. When a binary forks then we want to have two coverage files (and that is what bisect will give us unless any of the processes are terminated abruptly). However, it's hard to make both the parent and child process leave some kind of trace that allows us to verify that the correct number of coverage files were written.

aantron commented 2 years ago

Probably addressed (pending full-scale testing :P) by @arvidj in #390, so closing for now. Thanks!