ansible / receptor

Project Receptor is a flexible multi-service relayer with remote execution and orchestration capabilities linking controllers with executors across a mesh of nodes.
Other
160 stars 79 forks source link

Linearizability of Receptor #484

Open chrismeyersfsu opened 2 years ago

chrismeyersfsu commented 2 years ago

For a couple weeks now we have been chasing a series of bugs that are first found by throwing the kitchen sink of integration tests in AWX, of which Receptor is a component. Ideally, we would have more precise tests that we can run while developing to ensure we don't regress on the critical path of Receptor. That critical path is, basically, executing jobs AND continuing to do so under failure and recovery conditions. We could spend our effort on increasing our unit test coverage of precise permutations of failure conditions, but I'd like to take it a different direction. What if we more formally validate the behavior or Receptor? Here is what I am thinking ..

Receptor is a Mesh and has distributed systems properties, maybe not full on distributed system, but it quacks like a distributed system. We should test it like a distributed system. How exactly? We should define the valid state(s) that the system can be in. We should have a "simulator" that can take a recording of operations i.e. <parallel_stream_id, timestamp_start, time_end, operation, input, output>. The simulator will call the validate state code for the operation and determine if the output is correct for the input, given the state.

This simulator is only as good as the recordings. To get good recordings we would need to standup receptor and do things to it while at the same time causing chaos, to invoke those edge cases so that we can verify the recordings for those edge cases.

So what do we need to actually build here?

1. Instrument Receptor with operation logging (much like job lifecycle in AWX)

We need to only log a few operations:

Operation Input Output
start work local_node, remote_node work_unit_id
cancel work node, work_unit_id null
status node work_unit_id1, work_unit_id2, ...

TODO: I am wondering if status output needs to contain the status of the work unit (i.e. running vs. canceled, etc)

THOUGHTS: To start with I think the above is enough to start. I'd like to add something like stream stdout later.

2. Bring up a largish Receptor mesh & run commands against it

We have this today with the test scaffolding. We may need to expand on the "run commands" where we have sort of a producer and consumer. TODO: add more thoughts on this

3. Code the state checker

https://github.com/anishathalye/porcupine <-- we can heavily leverage Porcupine. It has an interface to express our state checker and to do linearizability checks on it. We don't so much care about the linearizability as much as we do about single-threaded validation to start with.

4. TODO: Add an example of what the output might look like and examples of possible failure(s)

shanemcd commented 2 years ago

No idea what the title means, but 👍 to the words in the description! 😄