Open keyboardDrummer opened 4 months ago
Problem started with this PR: https://github.com/boogie-org/boogie/pull/899 although that never got merged.
The first merged PR that suffers from the problem is the following. Going back up to ten PRs before that does not show the timeout problem. https://github.com/boogie-org/boogie/pull/901 Sat, 06 Jul 2024 15:38:38 GMT Image: ubuntu-22.04 Version: 20240630.1.0
Before that was this PR: https://github.com/boogie-org/boogie/pull/900 Fri, 05 Jul 2024 04:14:03 Image: ubuntu-22.04 Version: 20240630.1.0
Before that was: https://github.com/boogie-org/boogie/pull/898 Wed, 03 Jul 2024 14:32:37 Image: ubuntu-22.04 Version: 20240624.1.0
However, there was also an unmerged PR that suffers from the problem:
https://github.com/boogie-org/boogie/actions/runs/9796660534/job/27051593251
Thu, 04 Jul 2024 15:17:49
Operating System
Ubuntu
22.04.4
LTS
Runner Image
Image: ubuntu-22.04
Version: 20240630.1.0
Given that the unmerged PR is the first to run into timeouts, it seems that the issue is unrelated to any changes in Boogie. The PR before the unmerged problematic one is https://github.com/boogie-org/boogie/pull/898, but that itself has no issue, and neither does another PR after it, so that PR does not seem to be at fault.
The issue seems to come from the image version upgrade from 20240624.1.0 to 20240630.1.0, since all runs with the problem are in 20240630.1.0 and none are on 20240624.1.0.
One thing that's a shame is that the timeout comes from lit, not from the Boogie process. It would be good to configure timeouts at the Boogie level that are shorter than the lit timeout, so that when the timeout occurs, the Boogie process can give some feedback as to where it occurred.
Desperate move but maybe upgrading dotnet will help: https://github.com/boogie-org/boogie/issues/909
Update: nope, timeouts still occur there
The issue seems to come from the image version upgrade from 20240624.1.0 to 20240630.1.0, since all runs with the problem are in 20240630.1.0 and none are on 20240624.1.0.
The diff between those Ubuntu image versions is here, but it unfortunately gives me no ideas as to what might be going wrong. I can't imagine how any of those changes would have impacted anything.
One interesting data point is that the normal runtime for the tests that are hanging is extremely short. Verification time for each is a few milliseconds, and the Z3 resource count is usually under 1000, and always under 3000.
My best hypothesis is that this is a deadlock. Maybe the changes to the underlying system are somehow enough to make the non-determinism of concurrency more likely to go wrong?
It could be worth seeing what happens if we try reverting concurrency-related PRs and seeing whether that makes any difference (not that we'd want to revert them permanently, because I think they all fix important things, but it might help us understand what's going on). The ones I see in the recent past include #841, #854, #888, and maybe #898?
My best hypothesis is that this is a deadlock.
I agree. I increased the timeout by a huge amount and the timeouts still occurred.
Different tests seems to fail non-deterministically: