Fix Retry issues - Githubissues

kbrock commented 4 months ago

Dependencies

[x] https://github.com/ManageIQ/floe/pull/199
[x] https://github.com/ManageIQ/floe/pull/200 (I can remove if needed)

Overview

Running the steps was causing setting workflow#end? = true, so in some cases it was not retrying. In other cases it was losing the RetryCount. I think state_history was not properly stored. The most obvious issue here is you can notice the number of times we stub run_async! has changed. Think this should shine a light on the issues.

A few cases I increased the retry_count because that caused the errors to be exposed.

Changes

Workspace#step_next carries across RetryCount when there is an error and it ends up on the same state. I felt like there are a few potential edge cases - hence so many checks around propagating those fields.
next_state ||= fixed issues where we deleted a next_state and skipped the retry.
moving self after the retry logic helped us avoid setting end? = true which would have state.run_nonblock! working but workspace.run_nonblock failing.

kbrock commented 4 months ago

update:

rebase

kbrock commented 4 months ago

Ran the tests without the changes to verify this.

rspec ./spec/workflow/states/task_spec.rb:238

The WAS comment shows the values that get this to pass.

The fact that 1.times works here tells you that this is only run 1 time (and we never retry). This is a result of end? = true after the first call.

Once that was fixed, we get 3 history records, and a retry count of 3.

    describe "Retry" do
      let(:workflow) do
        make_workflow(
          ctx, {
            "State"        => {
              "Type"     => "Task",
              "Resource" => resource,
              "Retry"    => [{"ErrorEquals" => ["States.Timeout"], "MaxAttempts" => 2}],
              "Next"     => "SuccessState"
            }.compact,
            "SuccessState" => {"Type" => "Succeed"},
          }
        )
      end

      context "with specific errors" do
        let(:retriers) { [{"ErrorEquals" => ["States.Timeout"], "MaxAttempts" => 2}] }

        it "retries if that error is raised" do
          # WAS: 1.times
          # 1 regular run + 2 retries = 3 times
          3.times { expect_run_async(input, :error => "States.Timeout") }

          workflow.run_nonblock # WAS: workflow.current_state.run_nonblock

          expect(ctx.next_state).to          be_nil # WAS: "State"
          expect(ctx.state["Retrier"]).to    eq(["States.Timeout"])
          expect(ctx.state["RetryCount"]).to eq(3) # WAS: 1
          expect(ctx.state_history.count).to eq(3) # WAS: 1
          expect(ctx.input).to               eq(input)
          expect(ctx.output).to              eq({"Error" => "States.Timeout"}) # WAS nil
          expect(ctx.status).to              eq("failure")
          expect(ctx.ended?).to              eq(true)
        end

miq-bot commented 4 months ago

Checked commits https://github.com/kbrock/floe/compare/a1d25f07e4955272bba19e8ea60b98dbb9a8665f~...4172ab3ac57053600ad0937735f56350b07882ec with ruby 2.7.8, rubocop 1.56.3, haml-lint 0.51.0, and yamllint 4 files checked, 0 offenses detected Everything looks fine. :cake:

agrare commented 4 months ago

Great find @kbrock

ManageIQ / floe

Fix Retry issues #202

Dependencies

Overview

Changes