Make sure bisect-ppx-report handles corrupted coverage files gracefully

vch9 commented 3 years ago

Regarding corrupted coverage files do you have a hint on the difference between the following errors:

Info: found coverage files in '_coverage_output/'
bisect-ppx-report: internal error, uncaught exception:
                   Bisect_common.Invalid_file("_coverage_output/277891434.coverage", "unexpected end of file while reading magic number")

and

Info: found coverage files in '_coverage_output/'
bisect-ppx-report: internal error, uncaught exception:
                   Bisect_common.Invalid_file("_coverage_output/697300040.coverage", "exception reading data: End_of_file")

vch9 commented 3 years ago

Well I just had another exception raised. If you're not already working on this @aantron and need help (review, pr etc) I'll be happy to help as we deeply need this feature

aantron commented 3 years ago

Regarding corrupted coverage files do you have a hint on the difference between the following errors:

They are the same error (reading a truncated file, so End_of_file), but occurring during different parts of the read. The first is occurring during reading of a magic number in the file header, and the second during reading of the main file "payload." Neither should be fatal like this or reported in this way.

I haven't started working on this in particular. What I've done so far is considerably refactored the reporter to make it easier to work on, because I dreaded working on it due to some legacy holdover from features deleted in the last couple of years. I've just removed all of those. You can see the progress in src/report in master.

Would you like to work on this directly? If not, my plan is to first clarify some of the I/O functions that are most responsible for all the I/O errors, and then fix this issue. If you want to work on this immediately, though, I will pause refactoring the reporter.

vch9 commented 3 years ago

I haven't started working on this in particular. What I've done so far is considerably refactored the reporter to make it easier to work on, because I dreaded working on it due to some legacy holdover from features deleted in the last couple of years. I've just removed all of those. You can see the progress in src/report in master.

I saw this; this is great :)

Would you like to work on this directly? If not, my plan is to first clarify some of the I/O functions that are most responsible for all the I/O errors, and then fix this issue. If you want to work on this immediately, though, I will pause refactoring the reporter.

My priority would be to fix the bad writing of coverage files to avoid the loss of information If you're already working or have plans to do this this should be faster. However, I could test when the issue is fixed to see how well that kind of exception is handled.

aantron commented 3 years ago

My priority would be to fix the bad writing of coverage files to avoid the loss of information If you're already working or have plans to do this this should be faster. However, I could test when the issue is fixed to see how well that kind of exception is handled.

I'm not working on that at all yet — I want to clean up the reporter a bit more first, before looking in detail at the writing of .coverage files. I also wanted to give time to get some more information/opinions in #384 before I do something that might turn out not to be necessary :)

aantron commented 3 years ago

The errors should be somewhat clearer now, and I've added some documentation about them in https://github.com/aantron/bisect_ppx/commit/d8917a399a39b843525d7b936ab938295b4a1264. I've simplified the code, so that it would be easier to implement graceful error handling. However, before adding it, it's better to work on making the writing as correct as possible instead, as per https://github.com/aantron/bisect_ppx/issues/385#issuecomment-954818828. Then, it will be clearer what the reader in the reporter has to actually adapt to. Any word on what from https://github.com/aantron/bisect_ppx/issues/384#issuecomment-947557816 makes sense in your testing environment?

vch9 commented 2 years ago

Any word on what from #384 (comment) makes sense in your testing environment?

It is most certainly related to this issue, we launch several processes and kill them if something went wrong.

aantron commented 2 years ago

Any word on what from #384 (comment) makes sense in your testing environment?

It is most certainly related to this issue, we launch several processes and kill them if something went wrong.

I mean what potential solution makes sense in your environment. Can you have the processes handle the signals better, for example?

vch9 commented 2 years ago

Any word on what from #384 (comment) makes sense in your testing environment?

It is most certainly related to this issue, we launch several processes and kill them if something went wrong.

I mean what potential solution makes sense in your environment. Can you have the processes handle the signals better, for example?

Yes this is what I'm trying to do on my side. However, I still believe bisect-ppx-report could emit a warning instead of failing if let's say only one coverage file is corrupted (in hundreds of them idk).

aantron commented 2 years ago

I've added a test that truncates a .coverage file to various lengths and shows the behavior of the reporter.

I still believe bisect-ppx-report could emit a warning instead of failing if let's say only one coverage file is corrupted (in hundreds of them idk).

We can probably add this as an option. I don't think it should be the default, because in a smaller project in which runs are more predictable:

corrupt coverage files should be extremely rare;
when they do occur, they suggest some kind of severe error that should not be ignored;
a missing coverage file causing only a warning (say, in CI) can silently skew final coverage statistics considerably in a smaller project.

arvidj commented 2 years ago

Any word on what from #384 (comment) makes sense in your testing environment?

Broadly, we have three types of tests:

unit tests that do not forking nor spawn any processes
integration tests that spawn processes by forking
system tests that execute separate binaries

For the first type, I see no reason why we should have any corrupted file, nor have we seen that.

For the second type of test, of which we have very few, we've experimented with adding signal handlers, attempting to make sure that everything terminates gracefully w.r.t. bisect. However, it's been tricky to find a solution that works well, especially when running in gitlab CI for some reason. In the latest version, which seems to work OK, we have a signal handler that catches sigterms from the parent process, and then make sure that bisect's at_exit hook is finally called. It does seem like I need to mask sigterm in bisect's at_exit hook though. You can read more about this here.

Since we need to both install signal handlers and modify bisect's runtime for each test, I think I'll try to implement the --sigterm option you propose in #384 : this would give us a handier solution if similar problems pop up in the future.

Regarding the third set of tests, as long as the testing frameworks terminates the binaries gracefully, and as long as the terminated binary has a sigterm handler in place that does nothing crazy, it seems like no modifications are needed.

Our goal is to never have any corrupted traces written and to make sure each launched processes writes a trace. And to fail fast when this is not the case. We figure this will give users more confidence in the coverage reports. We're just hoping this will not make our CI less robust and that we won't have debug tests corrupting traces all the time.

aantron commented 2 years ago

Thanks for that! I'm going to close this issue as likely unnecessary, since it seems it can be addressed by trying not to write corrupted coverage files. If that turns out to still not be practical, this issue can be revisited.

aantron / bisect_ppx

Make sure bisect-ppx-report handles corrupted coverage files gracefully #385