Typical uses of the roachtest monitor are as follows:
m := c.NewMonitor()
m.Go(func(context.Context) error {
// user logic...
})
m.Wait()
This is a summary of the two main functions provided by the monitor:
Go: starts a goroutine. This is useful because we never want bare go calls in roachtests (or anywhere, really -- see #58164); having automated panic recovery is nice to have for all tests. Note that calling Go will start the goroutine immediately.
Wait: serves two purposes. First, it waits for all goroutines started with Go to return (canceling on error like errgroup). In addition, it also starts a "processs monitor" (hence the name) that will watch for unexpected cockroach process deaths in the nodes it monitors.
This API is less than ideal for the following reasons:
Monitoring processes and managing goroutines are two completely separate functionalities. It is unexpected that we will only monitor node deaths if we also have some goroutine to run. We recently had to patch this module in order to allow callers to just monitor nodes, without having to run a function to achieve that goal (#108594).
We will only actually monitor nodes if the test calls Wait() on the monitor. While this is good practice and should happen every time, this is not enforced and tests can continue to pass even if they don't. In addition, node deaths that happen before the test calls Wait() go completely unnoticed.
Because goroutine management and process monitoring are coupled, it makes composition really difficult. Suppose a function A creates a new monitor in order to run a background task. Suppose also that we're writing a new test that wants to call A to reuse its functionality. If our test B wishes to restart nodes, we now have to find a way to pass that information down to A so that A's monitor can properly handle that node death event. In the general case, if our test involves creating N monitors (directly or indirectly), and we want to perform a node restart, we need to find a way to pass that information to all N monitors.
Proposal
A proposed way to deal with this is to promote node monitoring to the test runner as a default. Every test should have an active "node monitor" that checks for unexpected node deaths. We could have a single t.ExpectNodeDeath(n1) to indicate that this test expects n1 to die, and that can happen only once. What's currently the Monitor will be deprecated in tests in favour of a simple goroutine runner with builtin panic handling.
Typical uses of the roachtest monitor are as follows:
This is a summary of the two main functions provided by the monitor:
Go
: starts a goroutine. This is useful because we never want barego
calls in roachtests (or anywhere, really -- see #58164); having automated panic recovery is nice to have for all tests. Note that callingGo
will start the goroutine immediately.Wait
: serves two purposes. First, it waits for all goroutines started withGo
to return (canceling on error likeerrgroup
). In addition, it also starts a "processs monitor" (hence the name) that will watch for unexpected cockroach process deaths in the nodes it monitors.This API is less than ideal for the following reasons:
Monitoring processes and managing goroutines are two completely separate functionalities. It is unexpected that we will only monitor node deaths if we also have some goroutine to run. We recently had to patch this module in order to allow callers to just monitor nodes, without having to run a function to achieve that goal (#108594).
We will only actually monitor nodes if the test calls
Wait()
on the monitor. While this is good practice and should happen every time, this is not enforced and tests can continue to pass even if they don't. In addition, node deaths that happen before the test callsWait()
go completely unnoticed.Because goroutine management and process monitoring are coupled, it makes composition really difficult. Suppose a function
A
creates a new monitor in order to run a background task. Suppose also that we're writing a new test that wants to callA
to reuse its functionality. If our test B wishes to restart nodes, we now have to find a way to pass that information down toA
so that A's monitor can properly handle that node death event. In the general case, if our test involves creating N monitors (directly or indirectly), and we want to perform a node restart, we need to find a way to pass that information to all N monitors.Proposal
A proposed way to deal with this is to promote node monitoring to the test runner as a default. Every test should have an active "node monitor" that checks for unexpected node deaths. We could have a single
t.ExpectNodeDeath(n1)
to indicate that this test expects n1 to die, and that can happen only once. What's currently theMonitor
will be deprecated in tests in favour of a simple goroutine runner with builtin panic handling.Jira issue: CRDB-35614