azriel91 / peace

Zero Stress Automation
https://peace.mk
Apache License 2.0
112 stars 1 forks source link

`ApplyCmd`: Cancel safety #141

Closed azriel91 closed 10 months ago

azriel91 commented 1 year ago

Enables users to interrupt an execution through resilience.

Store states for items that have been applied.

Cancellation points within item apply_exec() may be deferred.

Looked at:

azriel91 commented 1 year ago
  1. We want interruptions.

  2. CmdBase should handle receiving interruptions, and sending to interrupt_tx.

  3. interrupt_rx needs to be passed to *Cmds that iterate over items, so that the iterator can be interrupted.

  4. CmdBase should probably instantiate the (interrupt_tx, interrupt_rx) channel.

  5. Should OutputWrite be the trait to interact with the outside world?

    • If receiving interruptions is in the same trait, then:

      • one param needs to be passed in for receiving input + writing output to the outside world.
      • it may not be possible to receive input and write to output at the same time within CmdBase, meaning the OutputWrite implementation needs to handle that.
      • Maybe possible through channels.
    • If receiving interruptions is in a different trait, then

      • the developer needs to pass in two params.
      • it may not be possible if it needs mutable access.
      • maybe make it immutable references, and use channels on the OutputWrite implementation.
  6. How do the output endpoints we know also receive input?

    Endpoint Input Output
    CI process signals only stdout (append), logs
    CLI stdin stdout (interactive / append)
    Web API web request HTTP response
    WASM function call function call
  7. How would it look like in code:

    • framework CmdBase:

      • ProgressRender holds output to submit progress.
      • fn_exec must race with interrupt_rx.recv().

      Means, the developer needs to pass in something that produces interrupt_tx, interrupt_rx, and CmdBase calls that generator function, and passes interrupt_rx to the *Cmd.

    • framework / implementor Endpoint:

      impl ProgressEndpoint for CliEndpoint {
          async fn progress_begin(&mut self) {}
          async fn progress_update(&mut self) {}
          async fn progress_end(&mut self) {}
      }
      
      impl OutputEndpoint for CliEndpoint {
          async fn present(&mut self) {}
      }
      
      impl InputEndpoint for CliEndpoint {
          async fn interrupt_channel(&mut self) -> Receiver<InterruptSignal> {}
      }
      
      impl IntoEndpoint for CliEndpoint {
          async fn into_endpoint(self) -> Endpoint {}
      }
      
      /// Cloneable, so developers can choose whether or not this endpoint
      /// is used for input/progress/outcome output.
      ///
      /// Held by the Peace framework.
      struct Endpoint {
          output_tx: Sender<Box<dyn Presentable>>,
          progress_tx: Sender<ProgressUpdate>,
          interrupt_rx: Receiver<InterruptSignal>,
      }
    • developer:

      // framework:
      // CmdBase - needs `output` only for progress
      
      // developer, one of:
      let cmd_ctx_builder = CmdCtx::builder_x()
          .with_input(input)
          .with_output(output)
          .await?;
      let cmd_ctx_builder = CmdCtx::builder_x()
          .with_endpoint(endpoint) // only one endpoint?
          .await?;
  8. Should we support multiple input / output endpoints?

    1. We may want to have executions / progress / telemetry from a single request, so multiple output endpoints is a plausible use case.

    2. Input may be interrupted by a user, a failsafe by an automation infrastructure maintainer, or by inbuilt rate limiting safe-guards, so multiple input endpoints is plausible.

    3. Would all interruptions be through the same input endpoint?

      1. Is the input endpoint one per process (stdin, interrupt), or one per command context (web request)?
      2. It's not even the same web request for the interrupt.
      3. How would we interrupt an execution from a web request?
      4. The interrupt_tx must be acquirable / accessible from another thread, as interruption would be another request.
      5. The interrupt_tx needs to be stored in memory.
      6. Having only one way to access the interrupt_tx per command context is convenient -- otherwise we would have to poll multiple interrupt_rxs -- one for each endpoint.
      7. It's not handy if you want to interrupt via SIGINT on an automation web server (and interrupt all processes).
    4. To support interruptions from a separate task, we need to:

      1. Store running executions in a map.
      2. When interruption comes in for a given execution ID, we look for the execution in the map.
      3. Send an interruption to that execution.
      4. For CLI, an ID is not passed in (plain SIGINT), but we take it to mean we will interrupt all executions.
      5. An "execution" may be a queue of commands, meaning all remaining commands in the queue will not be executed.
      6. From the execution's point of view, interruption will only come through one channel.
      7. There can be multiple endpoints connected to the process.
      8. So, we have to separate the main task from the "run this command" task.
      9. For CLI, the main task just waits for all executions to complete, then displays output.
      10. For a web server, the main task listens for connections.
    5. Would all command progress / outcome be written through the same output endpoint?

      1. From the execution's point of view, it should only have to write to one output endopint. Then that output endpoint replicates it to all the other endpoints.
      2. For CLI, there would commonly be one output endpoint, though you may also want to write to a file -- both progress and outcome.
      3. For a web server1, progress is pulled, so we need to store it in memory.
      4. For a web server, outcome is pulled, so we need to store it in a database.
      5. For WASM2, progress is pushed, like in CLI.
      6. For WASM, outcome is pushed or pulled, we write to state on the server, which gets pushed to clients, but clients can also reload the page and request state.
  9. Should we refactor the codegen crate now?

    We should do it when we've settled on a design.

1 Web server in this context means a web API, e.g. a REST API or a server function in leptos.

2 WASM here means pure WASM automation, independent of rendering -- which could be client side rendering (CSR) or server side rendering (SSR). The WASM automation could still send information to the server for telemetry purposes.

azriel91 commented 1 year ago

Implementation

Notes

  1. CmdCtx is used to execute one or more *Cmds.
  2. Invoking a command should queue a Cmd to be executed, with X parameters.
  3. That should store an entry in a queue, and return an ID.
  4. An interrupt is an input taken in from an endpoint, which may be paired with the ID.
  5. When an interrupt is received, look in the queue, and send an interrupt signal to the relevant execution(s).

Plan

  1. Add CmdBlock, which is one "iterator operation" for all items -- one of: discovering items, or cleaning up items, or ensuring items. This is a genericized *Cmd::exec call.

    Each CmdBlock could have errors per item, and it also could have an error for the full block.

    Example, state discovery may fail to discover state for one or more items, or it could fail to serialize states to storage.

  2. Add CmdExecution, which is the "full command", which discovers, cleans, and ensures. This is a Vec<CmdBlock>.

    This is a queue of CmdBlocks to execute, and so it creates the interrupt_rx to pass to the CmdBlock, and itself holds the interrupt_tx to send the interruption signal.

    Developers will call CmdExecution::interrupt to interrupt the current execution, which propagates down to the CmdBlock.

  3. CmdExecution is a long lived item, which can be awaited until it is complete.

  4. Pass an Sender<CmdBlock> to CmdCtx, so that *Cmds can send in an CmdBlock.

  5. Developers need to either:

    • Invoke *Cmd::execs, which wait for the outcome, then return.
    • Invoke *Cmd::exec_bgs, which return an execution ID.
  6. CLI invocation will use *Cmd::exec, which will race:

    • awaiting the CmdExecution's completion, then return the Outcome.
    • receiving interruptions
  7. Add an CmdExecutionMap, which holds all executions for the process. This is an IndexMap<ExecutionId, CmdExecution>.

    There should be a channel for CmdExecRequest: (cmd_exec_request_tx, cmd_exec_request_rx).

    Executions will be on the thread that holds the map, and maybe they need to be spawned on that map as well(?).

  8. Web server will be polled for progress which returns "progress or done", then later will be queried for Outcome.

  9. Web server request for an interruption will call the CmdExecution to interrupt the execution. CmdExecution will return "interruption request received" or "nothing to interrupt".

  10. Subsequent polling of progress will automatically discover the interruption.

  11. CmdExecution will use channels behind the scenes, so that CmdBlocks and the web server can hold onto &CmdExecution without needing to request a lock.

  12. WASM app will have progress / outcome pushed to it, which it then can use to update the UI -- e.g. for leptos CSR (create_effect), or SSR (create_resource, with server function).