Spurious Joshua timeout in shared environment

apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store

https://apple.github.io/foundationdb/

Apache License 2.0

14.53k stars 1.31k forks source link

Spurious Joshua timeout in shared environment #4116

Open jzhou77 opened 3 years ago

jzhou77 commented 3 years ago

We observed that running in a shared environment (e.g., docker and AWS spot instances), the correctness runs of Joshua results are often timeout errors. These errors can become very noisy if the CPU resources on the test machine is heavily contended. So it would be very useful we can filter them out.

One idea I have is for TestHarness to check the progress of simulation runs. Rerun the simulation for timeout errors.

jzhou77 commented 3 years ago

For trivial problems: 1) no trace line produced for a while; 2) rerun passes, we can solve them.

TestHarness can also look at /proc file system.

sfc-gh-abeamon commented 3 years ago

This is part of the retry logic I was remembering. I haven't looked carefully to see if it does anything similar to what we want:

https://github.com/apple/foundationdb/blob/15b2f77de6477592e22779fece7af2744e20fec2/contrib/TestHarness/Program.cs.cmake#L920