refactor operation dispatcher

eguzki commented 3 weeks ago

The current implementation works (famous last words). However, it does not fit well with the lifetime of the tasks, which may (or may not) run (one or multiple) GRPC requests.

Description of the workflow

The operation dispatcher is created on create_http_context, which has the lifetime of one single HTTP request. It is not being reused for multiple HTTP requests re-using the same connection.
Each operation dispatcher needs to run a list of operations (actions in the configuration) one right after the other. There cannot be parallel execution. Therefore, only one of the actions can be waiting for the GRPC response.
Each operation can run one or multiple (for this latter we do not have use case for now) GRPC requests.
It is the responsibility of the operation (action) to send HTTP response when something goes wrong. For example: hostcalls::send_http_response(status_code: 429, response_headers, body: Some(b"Too Many Requests\n"))
Each operation is responsible to take actions based on failureMode.

The high level structure

The GRPC operation


pub enum OperationResult {  
    Done,  
    Wait,    
    Failed,   
}                  

trait GRPCOperation {
       // Will start the operation task 
       // The return can be either Done, Wait or Failed
       fn start(&self) -> OperationResult;

       // Will resume operation task
       // The return can be either Done, Wait or Failed
       fn on_grpc_response(&mut self, token_id: u32, status_code: u32, resp_size: usize) -> OperationResult;
}

The GRPC operations are started and then they can run N GRPC requests. Each time a GRPC request is done, the operation will be waiting for response that will be signaled with the on_grpc_response.

The operation result has the following meaning:

Done => The operation is done and its methods start and on_grpc_response should not be called.
Wait => The operation is in waiting state and it is expecting that the on_grpc_response will eventually be called
Failed => The operation has failed and entered unrecoverable state. Its methods start and on_grpc_response should not be called.

When something failed on the operation work, the operation should check FailureMode. If failure mode is deny, then the operation should send HTTP response self.send_http_response(status_code: 500, headers: vec![], body: Some(b"Internal Server Error.\n")) or whatever it needs to be send. For instance, on over the limit in a rate limiting action, it should send hostcalls::send_http_response(status_code: 429, response_headers, body: Some(b"Too Many Requests\n")). Then return OperationResult::Failed

On the other hand, if failure mode is allow, the operation should decide what to do. It can either be OperationResult::Failed or OperationResult::Done, but never OperationResult::Wait

The operation dispatcher


pub enum RunAction {  
    DoneOK,  
    DoneFailed,  
    Wait,
}                  

trait OperationDispatcher {
       // Will run the list of operations
       // On RunAction::DoneOK, the operation dispatcher is signaling it is done and does not expect to be called again in `run`
       // On RunAction::DoneFailed, the operation dispatcher is signaling it is done and the workflow failed and does not expect to be called again in `run`
       // On RunAction::Wait, the operation dispatcher is signaling it is NOT done and does expect to be called again in `run`
       fn run(&mut self) -> RunAction;

       // Will resume the work of one of the GRPC operations
       // It does not return anything. The dispatcher expects to be called on the "run" method again.
       fn on_grpc_response(&mut self, token_id: u32, status_code: u32, resp_size: usize);
}

From the Wasm HTTPContext, on http_request_headers, the operation dispatcher should be populated with the action sets and those which conditions do not apply, then filtered out. Then, the run method should be called and map RunAction to proxy_wasm::types::Action. The mapping should be straightforward

RunAction::DoneOK | RunAction::DoneFailed => proxy_wasm::types::Action::Continue
RunAction::Wait => proxy_wasm::types::Action::Wait.

From the Wasm HTTPContext, on on_grpc_call_response, the on_grpc_response method should be called. That should update the internal status of the current operation waiting for GRPC response. Then, resume dispatcher work running again run method. From on_grpc_call_response, it is needed to know whether the dispatcher is done and then call self.resume_http_request() method. This is what RunAction will tell.

RunAction::Wait | RunAction::DoneFailed => // Do nothing. For RunAction::DoneFailed case the oepration have already sent the error response like hostcalls::send_http_response(status_code: 429, response_headers, body: Some(b"Too Many Requests\n"))
RunAction::DoneOK => self.resume_http_request()

When run method it called, internally, the operations will be executed. When one of the operations returns "Fail" state, from the dispatcher perspective, the entire workflow should be halted and the return value should be "RunAction::DoneFailed. When ALL the operations are done and the queue is empty, the return value should be "RunAction::DoneOK. When one of the operations return OperationResult::Wait, the dispatcher should return RunAction::Wait.

When on_grpc_response is called, the dispatcher knows which operation is waiting (the first of the list) and can call its on_grpc_response method. Then. wait run method to be called.

Operation dispatcher state machine

graph TD;
    INIT-->WAIT;
    WAIT-->WAIT;
    INIT-->OK;
    WAIT-->OK;
    INIT-->ERROR;
    WAIT-->ERROR;

eguzki commented 3 weeks ago

Not for today. I would love your thoughts @didierofrivia

didierofrivia commented 3 weeks ago

I think at the moment there's no reason for considering a refactor... specially after https://github.com/Kuadrant/wasm-shim/pull/125 that is solving the use cases you presented with the e2e tests. However, it will be definitely good to review not only the OperationDispatcher but the whole of the wasm shim.... starting from it's name. We might need to present real use cases and/or performance/complexity tests in order to not be biased by our assumptions on hypothetical use cases and workflows. However, these kind of analyses are good and we could revisit it after GA in order to assess its viability 👍🏼

eguzki commented 1 day ago

Should fix https://github.com/Kuadrant/wasm-shim/issues/149

Kuadrant / wasm-shim