Project Receptor is a flexible multi-service relayer with remote execution and orchestration capabilities linking controllers with executors across a mesh of nodes.
For a couple weeks now we have been chasing a series of bugs that are first found by throwing the kitchen sink of integration tests in AWX, of which Receptor is a component. Ideally, we would have more precise tests that we can run while developing to ensure we don't regress on the critical path of Receptor. That critical path is, basically, executing jobs AND continuing to do so under failure and recovery conditions. We could spend our effort on increasing our unit test coverage of precise permutations of failure conditions, but I'd like to take it a different direction. What if we more formally validate the behavior or Receptor? Here is what I am thinking ..
Receptor is a Mesh and has distributed systems properties, maybe not full on distributed system, but it quacks like a distributed system. We should test it like a distributed system. How exactly? We should define the valid state(s) that the system can be in. We should have a "simulator" that can take a recording of operations i.e. <parallel_stream_id, timestamp_start, time_end, operation, input, output>. The simulator will call the validate state code for the operation and determine if the output is correct for the input, given the state.
This simulator is only as good as the recordings. To get good recordings we would need to standup receptor and do things to it while at the same time causing chaos, to invoke those edge cases so that we can verify the recordings for those edge cases.
So what do we need to actually build here?
1. Instrument Receptor with operation logging (much like job lifecycle in AWX)
We need to only log a few operations:
Operation
Input
Output
start work
local_node, remote_node
work_unit_id
cancel work
node, work_unit_id
null
status
node
work_unit_id1, work_unit_id2, ...
TODO: I am wondering if status output needs to contain the status of the work unit (i.e. running vs. canceled, etc)
THOUGHTS: To start with I think the above is enough to start. I'd like to add something like stream stdout later.
2. Bring up a largish Receptor mesh & run commands against it
We have this today with the test scaffolding. We may need to expand on the "run commands" where we have sort of a producer and consumer. TODO: add more thoughts on this
3. Code the state checker
https://github.com/anishathalye/porcupine <-- we can heavily leverage Porcupine. It has an interface to express our state checker and to do linearizability checks on it. We don't so much care about the linearizability as much as we do about single-threaded validation to start with.
4. TODO: Add an example of what the output might look like and examples of possible failure(s)
For a couple weeks now we have been chasing a series of bugs that are first found by throwing the kitchen sink of integration tests in AWX, of which Receptor is a component. Ideally, we would have more precise tests that we can run while developing to ensure we don't regress on the critical path of Receptor. That critical path is, basically, executing jobs AND continuing to do so under failure and recovery conditions. We could spend our effort on increasing our unit test coverage of precise permutations of failure conditions, but I'd like to take it a different direction. What if we more formally validate the behavior or Receptor? Here is what I am thinking ..
Receptor is a Mesh and has distributed systems properties, maybe not full on distributed system, but it quacks like a distributed system. We should test it like a distributed system. How exactly? We should define the valid state(s) that the system can be in. We should have a "simulator" that can take a recording of operations i.e.
<parallel_stream_id, timestamp_start, time_end, operation, input, output>
. The simulator will call the validate state code for theoperation
and determine if theoutput
is correct for theinput
, given the state.This simulator is only as good as the recordings. To get good recordings we would need to standup receptor and do things to it while at the same time causing chaos, to invoke those edge cases so that we can verify the recordings for those edge cases.
So what do we need to actually build here?
1. Instrument Receptor with operation logging (much like job lifecycle in AWX)
We need to only log a few operations:
local_node, remote_node
work_unit_id
node, work_unit_id
null
node
work_unit_id1, work_unit_id2, ...
TODO: I am wondering if status output needs to contain the status of the work unit (i.e. running vs. canceled, etc)
THOUGHTS: To start with I think the above is enough to start. I'd like to add something like
stream stdout
later.2. Bring up a largish Receptor mesh & run commands against it
We have this today with the test scaffolding. We may need to expand on the "run commands" where we have sort of a producer and consumer. TODO: add more thoughts on this
3. Code the state checker
https://github.com/anishathalye/porcupine <-- we can heavily leverage Porcupine. It has an interface to express our state checker and to do linearizability checks on it. We don't so much care about the linearizability as much as we do about single-threaded validation to start with.
4. TODO: Add an example of what the output might look like and examples of possible failure(s)