make replay testing more user-friendly

cgolubi1 commented 6 months ago

The goal of replay testing is to throw compute time rather than human time at finding logic bugs / breaking changes in new code, even if the tester doesn't know what kind of bug they're looking for. That would be easier/more reliable if i sanded down some current rough edges in the replay test rig. Specific things i have in mind:

Flags:
- [x] There are things i commonly want to do, such as restrict which buttons get used in novel games, which i do by editing replay_loop on the replay site. Those should be CLI flags.
- [x] replay_loop should have help/usage text describing all the CLI flags.
- [x] It'd be great to have a flag that could generate /buttonmen/test/src/api/responder99Test.php from whatever's in the output directory right now, and one that could execute phpunit based on that. No reason to copy-paste those lists of commands while iterating on a replay test.
Automate a mix of behavior:
- [ ] There are behaviors that are not worthwhile to test on every loop iteration, because either they'll fail or they won't, but should be tested more than zero times. One example is replay-testing of the novel games created by the current iteration of replay_loop --- that would catch problems like introduction of unmodelled randomization that are rare, but important. Currently, those behaviors are tested only if a particular CLI flag is selected --- instead they should always be tested a small percentage of the time.
- [ ] Another example related to both of the above, is that i'd like to be able to specify that a randomly-chosen button (the default behavior) be selected a fraction of the time, so that i can mostly test buttons related to a particular PR but also look for regressions impacting new games with other buttons.
- [ ] We should also have a default behavior of testing with CustomBM a fraction of the time
- [ ] The mix of behaviors that replay_loop tests by default should be documented for quick reference, so we can easily know what's been tested for a particular branch.
Replay site stuff outside of replay_loop that should work better:
- [ ] When the replay site container gets replaced, the new container doesn't have the post-install stuff, including the cron job that causes it to complain about not having run recent tests, so i have no way to get notified about the problem
Changes to random_ai behavior:
- [x] Less manual work should be needed to turn a game created by random_ai, into a responder test that can be committed to the codebase.
Handle current known bugs or infrastructure issues:
- [ ] Sometimes 2 seconds is not long enough to sleep after stopping mysqld, so the new mysqld can't start, and the loop fails. (But often 2 seconds is long enough, so what i want is backoff, not just a longer sleep.)
- [ ] RandomAI fails with OOM when playing a game which gets cancelled after 200 rounds.

cgolubi1 commented 4 months ago

Another thought to integrate into the above list: it would be great to have less manual work (ideally none) needed to turn random_ai generated tests into responder tests.

cgolubi1 commented 1 week ago

A silly bug: when a game runs to 200 rounds (e.g. Echo vs IIconfused) and is cancelled, RandomAI falls over.

I think what's happening is simply that the python process gets OOM-killed when it's trying to pull the entire game action log into memory and write the final game state.

buttonmen-dev / buttonmen

make replay testing more user-friendly #2953