boogie-org / boogie

Boogie
http://research.microsoft.com/en-us/projects/boogie/
MIT License
504 stars 109 forks source link

Tests time out non-deterministically #909

Open keyboardDrummer opened 1 month ago

keyboardDrummer commented 1 month ago

Different tests seems to fail non-deterministically:

keyboardDrummer commented 1 month ago

Problem started with this PR: https://github.com/boogie-org/boogie/pull/899 although that never got merged.

keyboardDrummer commented 1 month ago

The first merged PR that suffers from the problem is the following. Going back up to ten PRs before that does not show the timeout problem. https://github.com/boogie-org/boogie/pull/901 Sat, 06 Jul 2024 15:38:38 GMT Image: ubuntu-22.04 Version: 20240630.1.0

Before that was this PR: https://github.com/boogie-org/boogie/pull/900 Fri, 05 Jul 2024 04:14:03 Image: ubuntu-22.04 Version: 20240630.1.0

Before that was: https://github.com/boogie-org/boogie/pull/898 Wed, 03 Jul 2024 14:32:37 Image: ubuntu-22.04 Version: 20240624.1.0

However, there was also an unmerged PR that suffers from the problem:

https://github.com/boogie-org/boogie/actions/runs/9796660534/job/27051593251
Thu, 04 Jul 2024 15:17:49
Operating System
  Ubuntu
  22.04.4
  LTS
Runner Image
  Image: ubuntu-22.04
  Version: 20240630.1.0

Given that the unmerged PR is the first to run into timeouts, it seems that the issue is unrelated to any changes in Boogie. The PR before the unmerged problematic one is https://github.com/boogie-org/boogie/pull/898, but that itself has no issue, and neither does another PR after it, so that PR does not seem to be at fault.

The issue seems to come from the image version upgrade from 20240624.1.0 to 20240630.1.0, since all runs with the problem are in 20240630.1.0 and none are on 20240624.1.0.

keyboardDrummer commented 1 month ago

One thing that's a shame is that the timeout comes from lit, not from the Boogie process. It would be good to configure timeouts at the Boogie level that are shorter than the lit timeout, so that when the timeout occurs, the Boogie process can give some feedback as to where it occurred.

keyboardDrummer commented 1 month ago

Desperate move but maybe upgrading dotnet will help: https://github.com/boogie-org/boogie/issues/909

Update: nope, timeouts still occur there

atomb commented 1 month ago

The issue seems to come from the image version upgrade from 20240624.1.0 to 20240630.1.0, since all runs with the problem are in 20240630.1.0 and none are on 20240624.1.0.

The diff between those Ubuntu image versions is here, but it unfortunately gives me no ideas as to what might be going wrong. I can't imagine how any of those changes would have impacted anything.

One interesting data point is that the normal runtime for the tests that are hanging is extremely short. Verification time for each is a few milliseconds, and the Z3 resource count is usually under 1000, and always under 3000.

My best hypothesis is that this is a deadlock. Maybe the changes to the underlying system are somehow enough to make the non-determinism of concurrency more likely to go wrong?

It could be worth seeing what happens if we try reverting concurrency-related PRs and seeing whether that makes any difference (not that we'd want to revert them permanently, because I think they all fix important things, but it might help us understand what's going on). The ones I see in the recent past include #841, #854, #888, and maybe #898?

keyboardDrummer commented 1 month ago

My best hypothesis is that this is a deadlock.

I agree. I increased the timeout by a huge amount and the timeouts still occurred.