emichael / dslabs

Distributed Systems Labs and Framework
https://ellismichael.com/dslabs/
1.23k stars 339 forks source link

Run test by simulating #41

Open sgdxbc opened 1 year ago

sgdxbc commented 1 year ago

This is an early-staged, ongoing effort to support deterministic simulated execution run tests. I would like to hear about comments before moving on, e.g., whether this feature is expected and will be merged. Planned tasks and progress of work are listed at the end.

Motivation. In the teaching experience of CS5223 of NUS, one critical issue we faced is about the poor reproducibility of complicated run tests. In specific:

Introducing deterministic simulation may help a lot. With the same random seed, a solution will always reproduce the same result (as long as the solution is deterministic), independent to machine performance and whether logging is enabled. For grading, we can simply select several random seeds for each submission, and for each failed solution inform student the failing seed(s), so they can confirm the failure reliably.

How this is different from single-threaded execution? Current simulation implementation uses single thread, shares a lot common with single-threaded mode. Actually, simulation is currently implemented as an "enhanced" single-threaded mode. The key differences include:

The interface. Simulation can be enabled by specifying --simulated flag to run-tests.py script. All run tests will be executed in simulation mode, with addition logging

$ python3 run-tests.py --lab 1 --part 2 --test 2 --simulated -g FINER
--------------------------------------------------
TEST 2.2: Single client basic operations [RUN] (20pts)

[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.042328000000000005ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(client1 -> server, Request(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.066084ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(server -> client1, Reply(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.09002500000000001ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(client1 -> server, Request(...
# lines omitted
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(client1 -> server, Request(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 0.951671ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] MessageReceive(server -> client1, Reply(...
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.testing.runner.RunState] Virtual time 99.996195ms
[FINER  ] [2022-11-07 22:04:55] [dslabs.framework.Node] TimerReceive(-> client1, ClientTimer(...
...PASS (0.133s)
==================================================

Tests passed: 1/1
Points: 20/20 (100.00%)
Total time: 0.143s

ALL PASS
==================================================

If run multiple times, the virtual timestamps of all messages and timers keep the same. However, if the seed is customized with another command line argument --seed <n>, the timestamps will change, and the concurrent messages may receive in different order.

Internally, dslabs.framework.testting.runner.SimulatedNetwork is added as a drop-in replacement of the plain Network. Minimal adaption is applied to RunState, and there is no change in labs' test cases.

Semantic and compatibility. In simulation, message deliver time is normal distributed with mean 38us and standard deviation 20us. The mean time is based on benchmark of lab 1, part 3, long-running workload. On my machine both simulation and real execution finish ~40K requests per client with this mean time. The standard deviation is tentatively selected and may be refined with further evaluation. For timers, a 20us standard deviation is also added to model a small time skew and prevent potential trouble if multiple events happen at the exact same virtual time.

In test cases RunState.run is called when we want to block until ending condition satisfied, and RunState.start is called when we want to start running in background, while concurrently operate in current thread, e.g., send command and check. The latter case makes the run test not "isolated" to the open world, and we can never know whether the earliest event we gathered from nodes will become the next event, or current thread will create some even earlier events concurrently to the simulation process.

To solve this, the simulation takes the following approach

The simulation will take 500ms wall clock time in normal case. And client requests are sent per 100ms in both wall clock time and virtual time. On a slow machine, simulation in background thread may not finish processing events of each 100ms virtual time interval in 100ms wall clock time, but client requests will still be submitted per 100ms wall clock time by current thread, results in virtual sent time earlier than expected and break the deterministic guarantee.

I hope in the future a thorough refactor on test cases could be applied so that the simulation could be fully decoupled with wall clock time, to address above issue and make every test able to automatically advance virtual time. Except this, simulation should be fully compatible with current system.

Evaluation and plans. Currently the implementation is tested against my solution of lab 1 and lab 3. Later it will be applied to students' solutions from last semester to check if there's any regression. I think it will be ready to be merged when all the following are done:

Some potential long-term tasks:

emichael commented 1 year ago

Hi Guangda! Thanks for the PR! This is very cool. I'm absolutely willing to merge something like this in the future.

So, a while back I did a bit of my own experimenting with deterministic replay. I tried cloning all events before delivery and adding them to a per-node log. I found there were two basic problems. First, it slowed down execution and required lots of memory. Second, the logs were so long that none of the machinery around search tests was all that useful. That approach might be worth reconsidering one day, but I like this because it's much lighter weight.

I'll leave some comments in the code and mark this as a draft for now. When you're ready to try to merge it, mark it as ready for for review.

emichael commented 1 year ago

Having read through the code, I think this could probably be cleaner if we refactored RunState a bit. The network interface is based around polling for events for a specific node, but it would be easier to just poll for the next event when you're running in single-threaded mode.

I'm very interested to learn if this helps students. I don't have a good intuition on whether or not it will. They certainly want to know if their 1 out of 100 failure will mean that they lose points. (I would argue that the whole point is to design systems which always work.) It's ultimately up to each instructor, but I wouldn't give students the seeds they will be tested against ahead of time. But then in terms of uncertainty around grading outcomes, without the test seeds they're back in the same situation with simulated time they were already in.

Certainly this will help with reproducibility, but I'm also a little concerned that a simulated network might not capture the kinds of failures that happen in multi-threaded runs. One of the things I'm concerned about is that single-threaded runs will be slower and might not have nodes make enough progress to get into bad states.

I'm not sure how to solve the external thread problem. One thing we could do is simply disable virtual time for those tests. That's not satisfactory. The "right" solution is to model external actions (e.g., the thread in the lab 2 repeated crashes test) as events which get put onto the same priority queue. This would require major refactoring of the framework and many/most tests.

sgdxbc commented 1 year ago

Thanks for the detailed reviewing and comments! I am glad to have such valuable feedback for this quick and dirty hack ^^

For the intuition of helping students let me explain the rationale a little bit more. I totally agree that the lab assignments should require students to implement systems that always work, and that is also the only comment we ever made to related questions. It just deep in the heart we know that it is extremely hard to verify whether a solution will always work against certain run tests. For myself I have experienced a hard time when working on constant repartition run test in Paxos lab. (1) I wrote a script to repeatedly run the test overnight to see whether it would fail again, although I was not sure what to do in the next morning if some runs actually failed, and (2) it did not fail in the repeating, but it failed again (by chance) when later I tried to run all tests in lab 3 in a row (this phenomenon is also confirmed by students and we still don't know the reason). It was such a nightmare, and it is always the first thing I remind of about the labs for all the time.

If we have deterministic simulated run tests, then for (1) in the next morning I can check the seed that cause the failure and reliably reproduce the failure with it, and (2) will never happen because of deterministic execution.

And yes, we should not disclosure the grading seeds and students still cannot verify whether a solution will always work according to the most strict definition of "always work" (which I think is impossible to achieve), but instead they can know for sure their solutions work for certain seeds, and they can make their own decision of how many seeds to test before submitting, i.e., define their own "always work". In my opinion this means the black box concept "always work" turns into gray box a little bit.

And also, even simulation cannot eliminate uncertainty completely, it can still improve the workflow. For example, occasionally there are students appealing points because they cannot reproduce failure on their own machines. For this situation the best we can do is basically let student show running on their machines lively for several times and give scores to them (and even this approach becomes less feasible because of pandemic), but we never know whether the solution also fails on student machines, just the chance is smaller. With failed seeds this can be completely resolved.

In conclusion, as you mentioned run test can always be treated as a special kind of random DFS search test with single probe and super long trace. The labs benefit a lot from controllable search tests, and it will sure benefit from more controllable run tests as well.

Some other points:

sgdxbc commented 1 year ago

The deterministic simulated mode for run test is finally done and ready for use. (Will call it "replayable mode" for short.)

Since it has been some time since this pull request (PR) was opened, I would like to briefly summarize my proposal and share any new thoughts I have had on the matter.

Motivation. First of all, this is a PR focused on debuggability, not to improve how/what this framework to test solutions. This PR is intended to help students who are embarking on the journey of spending a whole week on repeat partition test, and hopefully it can make their life easier primarily to address the following two problems:

It also has some additional features, such as being using for grading (after regression, discussed below), but most importantly it is a helper tool for students to debug their solutions less painfully especially for heavyweight run tests. And I believe this tool will be helpful.

The following table compares three run modes: Mode Can explore all failure Deterministic/replayable
Multithreaded Probably No
Single threaded No Yes if fixed random
Replayable Yes with correct time model Yes

Example. (Test summary omitted.)

$ ./run-tests.py --lab 0 --test 1 -g FINER
--------------------------------------------------
TEST 1: Single client ping test [RUN] (0pts)

[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] TimerReceive(-> client1, PingTimer(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
[FINER  ] [2022-12-24 16:45:43] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
...PASS (0.135s)

$ ./run-tests.py --lab 0 --test 1 --simulated -g FINER
--------------------------------------------------
TEST 1: Single client ping test [RUN] (0pts)

[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.testing.runner.SimulatedImpl] 0.706665 ms in simulation ...
[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.testing.runner.SimulatedImpl] 1.659471 ms in simulation ...
[FINER  ] [2022-12-24 16:46:00] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
...PASS (0.13s)

$ ./run-tests.py --lab 0 --test 1 --simulated -g FINEST
--------------------------------------------------
TEST 1: Single client ping test [RUN] (0pts)

[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageSend(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl]  ... will happen at 0.706665ms = 0.000000ms (now) + 0.644760ms (processed) + 0.061905ms (message latency)
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.Node] TimerSet(-> client1, PingTimer(ping=PingApplication.Ping(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl]  ... will happen at 11.174820ms = 0.000000ms (now) + 1.174955ms (processed) + 9.999866ms (timer length)
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl] 0.706665 ms in simulation ...
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageReceive(client1 -> pingserver, PingRequest(ping=PingApplication.Ping(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageSend(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
[FINEST ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl]  ... will happen at 1.659471ms = 0.706665ms (now) + 0.895460ms (processed) + 0.057346ms (message latency)
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.testing.runner.SimulatedImpl] 1.659471 ms in simulation ...
[FINER  ] [2022-12-24 16:46:03] [dslabs.framework.Node] MessageReceive(pingserver -> client1, PongReply(pong=PingApplication.Pong(value=Hello, World!)))
...PASS (0.148s)

Design. The goal is to maintain the implementation as modular, clearly separated, and fully compatible with tests. I apologize for not making use of subtyping (as I dislike it). Instead, I place all the implementation in a SimulatedImpl class, and patched other classes to inject it. The design is straightforward, but we can improve the hierarchy if necessary.

Previously I talked about the problem with RunState.start, I solved it in the following way. Let's call the thread that advances virtual time and dispatches events the simulation thread (similar to mainThread in RunState's single-threaded mode), and the other threads that may run between RunState.start and RunState.stop the interactive threads. The approach is to ensure that simulation thread and interactive threads are temporally mutual exclusive. The simulation thread pauses to advance virtual time/dispatch events as long as there is at least one interactive thread not in WAITING state. If any dispatched event causes interactive threads to wake up, for example if getResult returns, the simulation thread pauses again until interactive threads all become WAITING again, or until RunState.stop is called.

This solves the cases where interactive threads block on Object.wait, but it is not easy to patch Thread.sleep. As a result, I have to make a global substitution of Thread.sleep with runState.sleep in all tests, which is the only major modification required outside the framework.

I also made some misc changes, such as fixing randomness in various places. I made some comments for detailed design choices.

Model latency and processing time. Previously only message deliver latency and timer length are considered in virtual time. However, if we don't account for the time nodes spend on processing messages and timers, simulated result will diverge from real run on an inefficient implementation. On the other hand, processing time is not deterministic, so we cannot simply add it to virtual time.

Current time model is defined in SimulatedImpl. I benchmarked on three long-running workload tests, by printing out number of requests finished by each client (Intel i7-8750H, Windows 10 22H2, WSL Ubuntu 20.04): Test -g FINER 2> 1.txt -g FINEST 2> 1.txt --simulated
--lab 1 --part 3 --test 6 373745 26577 13210 23036
--lab 2 --part 2 --test 14 114722 16261 7386 6583
--lab 3 --test 17 2741 2467 1832 2215

Debugging experience. During development, I completed the solutions for lab 3 and 4 using the replayable mode. I complete the whole debugging on run test in replayable mode with default seed, and run tests in real mode never failed after it passes in replayable mode. The logging trace appears reasonable, and it took less time to capture.

Regression test for grading. I plan to add a regression script that will run a batch of solutions in both the replayable and real modes side by side. For each solution it will run real mode 3 times and the replayable mode with 10 different seeds. A test is considered failed in real/simulated mode if at least one run fails. The script will look for any case that fail in real mode but pass in simulated mode. If regression passes for a sufficient number of solutions from past semesters, it may be suitable for use in grading.

Alternatively, we could also grade in real mode, and if student requests clarification, we can search for a seed that also fails the tests and provide it to the student.

Other features.

Conclusion. I hope this work is useful. I still need to write a regression script and handout-side instructions, but the implementation itself is stable and ready to be code reviewed. I plan to maintain the replayable mode in downstream and introduce it to students in January, and I will be glad to see it get merged, get more tested, and help more students. Alternatively, I propose merging runState.sleep and tests to make it easier for me to merge from upstream in the future. Thanks for reading.

sgdxbc commented 1 year ago

Seems like nextLong(a, b) is only added in Java 17 and I was developing on Java 19. (But I still expect compiler to reject such backward-incompitable code.) Will fix soon.

sgdxbc commented 1 year ago
Updated time model benchmark Test -g FINER 2> /dev/null --simulated
--lab 1 --part 3 --test 6 210202 299932
--lab 2 --part 2 --test 14 117130 78805
--lab 3 --test 17 2663 (same without log) 2886