Run test by simulating - Githubissues

This is an early-staged, ongoing effort to support deterministic simulated execution run tests. I would like to hear about comments before moving on, e.g., whether this feature is expected and will be merged. Planned tasks and progress of work are listed at the end.

Motivation. In the teaching experience of CS5223 of NUS, one critical issue we faced is about the poor reproducibility of complicated run tests. In specific:

Because of the difference on machine performance, the same solution may result differently among teammates' machines, and between students and grading machines
- On a slower machine time-limited tests are more likely to fail
- On a faster machine, more kinds of concurrent events could be explored and cause safety failure
A failed test may turn to be passed by simply enabling logging, because either it is running slower or concurrent events are eliminated by synchronization on standard out. Debugging in such situation is painful
Even on the same machine with same settings, a solution may fail by chance, especially for an almost-correct solution. This results in slower and slower progress when a solution is almost complete, taking long time to capture single failure run sample, or simply not sure "whether it still get a small chance to fail". It also makes it hard to answer student questions like "I run my solution 20 times and it failed only once, will I get full grade?"

Introducing deterministic simulation may help a lot. With the same random seed, a solution will always reproduce the same result (as long as the solution is deterministic), independent to machine performance and whether logging is enabled. For grading, we can simply select several random seeds for each submission, and for each failed solution inform student the failing seed(s), so they can confirm the failure reliably.

How this is different from single-threaded execution? Current simulation implementation uses single thread, shares a lot common with single-threaded mode. Actually, simulation is currently implemented as an "enhanced" single-threaded mode. The key differences include:

Introduce virtual time. Concurrency is not harmed because of single-threaded execution
Allow automatically time advancing when possible
Deterministic

The interface. Simulation can be enabled by specifying --simulated flag to run-tests.py script. All run tests will be executed in simulation mode, with addition logging

$ python3 run-tests.py --lab 1 --part 2 --test 2 --simulated -g FINER
--------------------------------------------------
TEST 2.2: Single client basic operations [RUN] (20pts)

[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.042328000000000005ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(client1 -> server, Request(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.066084ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(server -> client1, Reply(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.09002500000000001ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(client1 -> server, Request(...
# lines omitted
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(client1 -> server, Request(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.951671ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(server -> client1, Reply(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 99.996195ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] TimerReceive(-> client1, ClientTimer(...
...PASS (0.133s)
==================================================

Tests passed: 1/1
Points: 20/20 (100.00%)
Total time: 0.143s

ALL PASS
==================================================

If run multiple times, the virtual timestamps of all messages and timers keep the same. However, if the seed is customized with another command line argument --seed <n>, the timestamps will change, and the concurrent messages may receive in different order.

Internally, dslabs.framework.testting.runner.SimulatedNetwork is added as a drop-in replacement of the plain Network. Minimal adaption is applied to RunState, and there is no change in labs' test cases.

Semantic and compatibility. In simulation, message deliver time is normal distributed with mean 38us and standard deviation 20us. The mean time is based on benchmark of lab 1, part 3, long-running workload. On my machine both simulation and real execution finish ~40K requests per client with this mean time. The standard deviation is tentatively selected and may be refined with further evaluation. For timers, a 20us standard deviation is also added to model a small time skew and prevent potential trouble if multiple events happen at the exact same virtual time.

In test cases RunState.run is called when we want to block until ending condition satisfied, and RunState.start is called when we want to start running in background, while concurrently operate in current thread, e.g., send command and check. The latter case makes the run test not "isolated" to the open world, and we can never know whether the earliest event we gathered from nodes will become the next event, or current thread will create some even earlier events concurrently to the simulation process.

To solve this, the simulation takes the following approach

If RunState.run is called, no external event will be created, and simulation will always advance virtual time to the next earliest event
If RunState.start is called, simulation will never advance virtual time faster than real time, so consider pseudo code like this
```
runState.start(settings);
for (int i = 0; i < 5; i += 1) {
Thread.sleep(100);
sendCommand(command[i]);
}
runState.stop();
```

The simulation will take 500ms wall clock time in normal case. And client requests are sent per 100ms in both wall clock time and virtual time. On a slow machine, simulation in background thread may not finish processing events of each 100ms virtual time interval in 100ms wall clock time, but client requests will still be submitted per 100ms wall clock time by current thread, results in virtual sent time earlier than expected and break the deterministic guarantee.

I hope in the future a thorough refactor on test cases could be applied so that the simulation could be fully decoupled with wall clock time, to address above issue and make every test able to automatically advance virtual time. Except this, simulation should be fully compatible with current system.

Evaluation and plans. Currently the implementation is tested against my solution of lab 1 and lab 3. Later it will be applied to students' solutions from last semester to check if there's any regression. I think it will be ready to be merged when all the following are done:

[ ] Regression test
[x] Ensure that for all test cases including large-scaled ones in lab 4, simulation is always faster than real execution on average machine. If not, adjust test case timeout accordingly
[ ] Adjust code to follow project's design pattern and other conventions (probably not by me, I don't really like Java style ^^
[x] For client maximum waiting time tests, record virtual time instead of wall time clock
[x] Deterministic unreliable networking
[x] Fix for other potential nondeterministic sources, e.g., access Random instance from multiple threads
[ ] Find proper standard deviation of message delivering delay, i.e., the one that can explore all concurrent issues without giving too confusing trace
[ ] Document

Some potential long-term tasks:

Update grading scripts accordingly
Multithreading simulation
Introducing virtual time to search tests, so for example we can search for the trace that breaking maximum client waiting time constraints, etc.

Hi Guangda! Thanks for the PR! This is very cool. I'm absolutely willing to merge something like this in the future.

So, a while back I did a bit of my own experimenting with deterministic replay. I tried cloning all events before delivery and adding them to a per-node log. I found there were two basic problems. First, it slowed down execution and required lots of memory. Second, the logs were so long that none of the machinery around search tests was all that useful. That approach might be worth reconsidering one day, but I like this because it's much lighter weight.

I'll leave some comments in the code and mark this as a draft for now. When you're ready to try to merge it, mark it as ready for for review.

Having read through the code, I think this could probably be cleaner if we refactored RunState a bit. The network interface is based around polling for events for a specific node, but it would be easier to just poll for the next event when you're running in single-threaded mode.

I'm very interested to learn if this helps students. I don't have a good intuition on whether or not it will. They certainly want to know if their 1 out of 100 failure will mean that they lose points. (I would argue that the whole point is to design systems which always work.) It's ultimately up to each instructor, but I wouldn't give students the seeds they will be tested against ahead of time. But then in terms of uncertainty around grading outcomes, without the test seeds they're back in the same situation with simulated time they were already in.

Certainly this will help with reproducibility, but I'm also a little concerned that a simulated network might not capture the kinds of failures that happen in multi-threaded runs. One of the things I'm concerned about is that single-threaded runs will be slower and might not have nodes make enough progress to get into bad states.

I'm not sure how to solve the external thread problem. One thing we could do is simply disable virtual time for those tests. That's not satisfactory. The "right" solution is to model external actions (e.g., the thread in the lab 2 repeated crashes test) as events which get put onto the same priority queue. This would require major refactoring of the framework and many/most tests.

Thanks for the detailed reviewing and comments! I am glad to have such valuable feedback for this quick and dirty hack ^^

For the intuition of helping students let me explain the rationale a little bit more. I totally agree that the lab assignments should require students to implement systems that always work, and that is also the only comment we ever made to related questions. It just deep in the heart we know that it is extremely hard to verify whether a solution will always work against certain run tests. For myself I have experienced a hard time when working on constant repartition run test in Paxos lab. (1) I wrote a script to repeatedly run the test overnight to see whether it would fail again, although I was not sure what to do in the next morning if some runs actually failed, and (2) it did not fail in the repeating, but it failed again (by chance) when later I tried to run all tests in lab 3 in a row (this phenomenon is also confirmed by students and we still don't know the reason). It was such a nightmare, and it is always the first thing I remind of about the labs for all the time.

If we have deterministic simulated run tests, then for (1) in the next morning I can check the seed that cause the failure and reliably reproduce the failure with it, and (2) will never happen because of deterministic execution.

And yes, we should not disclosure the grading seeds and students still cannot verify whether a solution will always work according to the most strict definition of "always work" (which I think is impossible to achieve), but instead they can know for sure their solutions work for certain seeds, and they can make their own decision of how many seeds to test before submitting, i.e., define their own "always work". In my opinion this means the black box concept "always work" turns into gray box a little bit.

And also, even simulation cannot eliminate uncertainty completely, it can still improve the workflow. For example, occasionally there are students appealing points because they cannot reproduce failure on their own machines. For this situation the best we can do is basically let student show running on their machines lively for several times and give scores to them (and even this approach becomes less feasible because of pandemic), but we never know whether the solution also fails on student machines, just the chance is smaller. With failed seeds this can be completely resolved.

In conclusion, as you mentioned run test can always be treated as a special kind of random DFS search test with single probe and super long trace. The labs benefit a lot from controllable search tests, and it will sure benefit from more controllable run tests as well.

Some other points:

I also concerned about whether we can produce all kinds of concurrent events through simulation. However, there is also no guarantee that we can produce all kinds of events with native concurrency. Anyway in the planned regression test, I will check exactly for this: given a fixed number of tested seeds, whether a failing solution will also fail in simulation on at least one seed with the same failure. Also I don't see any obvious way to perform adversarial attack on simulation (at least not more obvious than attack on natively execution).
I also thought about falling back for RunState.start, but it seems like that would affect too many tests (noticeably the constant repartition test). As you said, the ideal way to write tests for simulation is to register external events against virtual time, like how we write tests for specpaxos. If necessary such refactor could be performed.
Before moving on we indeed need to redesign the abstractions. Beside Network itself, ClientWorkload need to depend on Network to acquire virtual time, so it can record maximum waiting time. It would be suboptimal to continue with current abstractions.
The default random seed has no special meaning. I just put it there to imply seed should always be explicitly set instead of randomly generated, so user always know the seed when test cases actually fail.

The deterministic simulated mode for run test is finally done and ready for use. (Will call it "replayable mode" for short.)

Since it has been some time since this pull request (PR) was opened, I would like to briefly summarize my proposal and share any new thoughts I have had on the matter.

Motivation. First of all, this is a PR focused on debuggability, not to improve how/what this framework to test solutions. This PR is intended to help students who are embarking on the journey of spending a whole week on repeat partition test, and hopefully it can make their life easier primarily to address the following two problems:

Reproducible: tests pass on one machine but fail on another, pass when enable log but fail when disabled, pass when running single test but fail when running all tests in a row.
Verifiable debugging: when working on a possibly-fail test, e.g., fails once per ten runs, after perform some modification on the code, one cannot easily tell whether the modification addresses the failure (or even changes anything). Student has to run (at least) ten more times to see whether it fails again. This iterative process can consume tons of time.

It also has some additional features, such as being using for grading (after regression, discussed below), but most importantly it is a helper tool for students to debug their solutions less painfully especially for heavyweight run tests. And I believe this tool will be helpful.

The following table compares three run modes:	Mode	Can explore all failure
Multithreaded	Probably	No
Single threaded	No	Yes if fixed random
Replayable	Yes with correct time model	Yes

Example. (Test summary omitted.)

$ ./run-tests.py --lab 0 --test 1 -g FINER
--------------------------------------------------
TEST 1: Single client ping test [RUN] (0pts)

[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] TimerReceive(-> client1, PingTimer(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
...PASS (0.135s)

$ ./run-tests.py --lab 0 --test 1 --simulated -g FINER
--------------------------------------------------
TEST 1: Single client ping test [RUN] (0pts)

[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.testing.runner.SimulatedImpl] 0.706665 ms in simulation ...
[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.testing.runner.SimulatedImpl] 1.659471 ms in simulation ...
[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
...PASS (0.13s)

$ ./run-tests.py --lab 0 --test 1 --simulated -g FINEST
--------------------------------------------------
TEST 1: Single client ping test [RUN] (0pts)

[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageSend(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl]  ... will happen at 0.706665ms = 0.000000ms (now) + 0.644760ms (processed) + 0.061905ms (message latency)
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.Node] TimerSet(-> client1, PingTimer(ping=PingApplication.Ping(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl]  ... will happen at 11.174820ms = 0.000000ms (now) + 1.174955ms (processed) + 9.999866ms (timer length)
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl] 0.706665 ms in simulation ...
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageSend(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl]  ... will happen at 1.659471ms = 0.706665ms (now) + 0.895460ms (processed) + 0.057346ms (message latency)
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl] 1.659471 ms in simulation ...
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
...PASS (0.148s)

Design. The goal is to maintain the implementation as modular, clearly separated, and fully compatible with tests. I apologize for not making use of subtyping (as I dislike it). Instead, I place all the implementation in a SimulatedImpl class, and patched other classes to inject it. The design is straightforward, but we can improve the hierarchy if necessary.

Previously I talked about the problem with RunState.start, I solved it in the following way. Let's call the thread that advances virtual time and dispatches events the simulation thread (similar to mainThread in RunState's single-threaded mode), and the other threads that may run between RunState.start and RunState.stop the interactive threads. The approach is to ensure that simulation thread and interactive threads are temporally mutual exclusive. The simulation thread pauses to advance virtual time/dispatch events as long as there is at least one interactive thread not in WAITING state. If any dispatched event causes interactive threads to wake up, for example if getResult returns, the simulation thread pauses again until interactive threads all become WAITING again, or until RunState.stop is called.

This solves the cases where interactive threads block on Object.wait, but it is not easy to patch Thread.sleep. As a result, I have to make a global substitution of Thread.sleep with runState.sleep in all tests, which is the only major modification required outside the framework.

I also made some misc changes, such as fixing randomness in various places. I made some comments for detailed design choices.

Model latency and processing time. Previously only message deliver latency and timer length are considered in virtual time. However, if we don't account for the time nodes spend on processing messages and timers, simulated result will diverge from real run on an inefficient implementation. On the other hand, processing time is not deterministic, so we cannot simply add it to virtual time.

Current time model is defined in `SimulatedImpl`. I benchmarked on three long-running workload tests, by printing out number of requests finished by each client (Intel i7-8750H, Windows 10 22H2, WSL Ubuntu 20.04):	Test		`-g FINER 2> 1.txt`	`-g FINEST 2> 1.txt`
`--lab 1 --part 3 --test 6`	373745	26577	13210	23036
`--lab 2 --part 2 --test 14`	114722	16261	7386	6583
`--lab 3 --test 17`	2741	2467	1832	2215

The parameters are tuned based on lab 3 performance. It is configured with a 10ms delay between requests.
The current time model may incorrectly assign shorter processing times to later messages than to earlier ones. Those parameters may be further tuned, and could be added to command line arguments if necessary.

Debugging experience. During development, I completed the solutions for lab 3 and 4 using the replayable mode. I complete the whole debugging on run test in replayable mode with default seed, and run tests in real mode never failed after it passes in replayable mode. The logging trace appears reasonable, and it took less time to capture.

Regression test for grading. I plan to add a regression script that will run a batch of solutions in both the replayable and real modes side by side. For each solution it will run real mode 3 times and the replayable mode with 10 different seeds. A test is considered failed in real/simulated mode if at least one run fails. The script will look for any case that fail in real mode but pass in simulated mode. If regression passes for a sufficient number of solutions from past semesters, it may be suitable for use in grading.

Alternatively, we could also grade in real mode, and if student requests clarification, we can search for a seed that also fails the tests and provide it to the student.

Other features.

Early quit on exception thrown. The replayable mode stops running when the first exception is thrown, while real mode continue running with broken nodes. It may also be useful to stop on first invariant violation e.g. exceeding the maximum waiting time. This would require modifying tests to declare invariants before RunState.run rather than after RunState.stop, which could be done in the future.
Speedup. Finished time in seconds: Test --no-search --no-search --simulated

--lab 1 54.828s 4.841s

--lab 2 96.634s 2.429s

--lab 3 141.529s 4.836s

--lab 4 383.955s 16.925s
Measure processing time and warn on long processing time. When a solution is free from deadlock and the focus shifts to making it not slow, the most important task is to identify any long-processing messages that cause a node to be unresponsive for a full timeout. The replayable mode logs a warning for long processing times to assist with this task.

Speedup. Finished time in seconds:	Test	`--no-search`	`--no-search --simulated`
`--lab 1`	54.828s	4.841s
`--lab 2`	96.634s	2.429s
`--lab 3`	141.529s	4.836s
`--lab 4`	383.955s	16.925s

Conclusion. I hope this work is useful. I still need to write a regression script and handout-side instructions, but the implementation itself is stable and ready to be code reviewed. I plan to maintain the replayable mode in downstream and introduce it to students in January, and I will be glad to see it get merged, get more tested, and help more students. Alternatively, I propose merging runState.sleep and tests to make it easier for me to merge from upstream in the future. Thanks for reading.

Seems like nextLong(a, b) is only added in Java 17 and I was developing on Java 19. (But I still expect compiler to reject such backward-incompitable code.) Will fix soon.

Updated time model benchmark	Test	`-g FINER 2> /dev/null`
`--lab 1 --part 3 --test 6`	210202	299932
`--lab 2 --part 2 --test 14`	117130	78805
`--lab 3 --test 17`	2663 (same without log)	2886

emichael / dslabs

Run test by simulating #41