cloudfoundry / bosh-cli

BOSH CLI v2+
Apache License 2.0
178 stars 160 forks source link

Add create-recovery-plan command #622

Closed selzoc closed 1 year ago

selzoc commented 1 year ago

This command will allow users to scan a deployment for problems, then write out a YAML-formatted recovery plan, to be used later in the recover command. The CLI prompts per instance group and then per problem type.

Example recovery plan:

instance_groups_plan:
- name: cloud_controller_ng
  max_in_flight_override: 66%
  planned_resolutions:
    missing_vm: recreate_vm_without_wait
- name: diego_cell
  planned_resolutions:
    missing_vm: delete_vm_reference
- name: router
  max_in_flight_override: "2"
  planned_resolutions:
    unresponsive_agent: delete_vm

Note Relies on a director with this PR merged. The command fails with "Director does not support this command. Try 'bosh cloud-check' instead" if the director it is targeting does not return instance groups for problems

klakin-pivotal commented 1 year ago

So, I ran the unit tests, and got one failure in create_recovery_plan_test.go. I haven't spent much time actually reviewing the code, so I currently have no opinion about it.

Summarizing 1 Failure: [FAIL] CreateRecoveryPlanCmd Run problems are found [It] writes a recovery plan based on answers /bosh-cli/cmd/create_recovery_plan_test.go:254

Ran 908 of 908 Specs in 27.500 seconds FAIL! -- 907 Passed | 1 Failed | 0 Pending | 0 Skipped --- FAIL: TestReg (27.72s) FAIL

jpalermo commented 1 year ago

Given this is a set of work that is under fairly active development. Is this something we should mark as experimental?

I say that without any idea how we'd go about doing that...

selzoc commented 1 year ago

Given this is a set of work that is under fairly active development. Is this something we should mark as experimental?

I say that without any idea how we'd go about doing that...

The command does print a warning and exits if the director doesn't support it. I don't think it's particularly experimental, but we also haven't done the other half of the command yet. Perhaps we could hold off on merging until bosh recover is also complete?

selzoc commented 1 year ago

So, I ran the unit tests, and got one failure in create_recovery_plan_test.go. I haven't spent much time actually reviewing the code, so I currently have no opinion about it.

  • I checked out create-recovery-plan-185483613
  • Started up Docker, then started up the bosh/cli image like so: docker run -it --volume=$HOME/workspace/bosh-cli:/bosh-cli bosh/cli:latest /bin/bash
  • And (after commenting out the lint stuff in the script) ran bin/test-unit, which emitted the following failure
root@b9e0db7204a6:/bosh-cli# ./bin/test-unit
<unimportant spew elided>
Will skip:
  ./acceptance
  ./integration
[1688428279] blobstore - 11/11 specs ••••••••••• SUCCESS! 20.861474ms PASS
[1688428279] cloud - 63/63 specs ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• SUCCESS! 80.240399ms PASS
[1688428279] cmd - 908/908 specs <success dots elided>
------------------------------
• [FAILED] [0.001 seconds]
CreateRecoveryPlanCmd Run problems are found [It] writes a recovery plan based on answers
/bosh-cli/cmd/create_recovery_plan_test.go:230

  [FAILED] Expected
      <map[string]string | len:2>: {
          "missing_vm": "reboot_vm",
          "mount_info_mismatch": "reattach_disk",
      }
  to have {key: value}
      <map[interface {}]interface {} | len:1>: {
          <string>"missing_vm": <string>"recreate_vm",
      }
  In [It] at: /bosh-cli/cmd/create_recovery_plan_test.go:254 @ 07/03/23 23:53:59.955
------------------------------
<success dots elided>

Summarizing 1 Failure:
  [FAIL] CreateRecoveryPlanCmd Run problems are found [It] writes a recovery plan based on answers
  /bosh-cli/cmd/create_recovery_plan_test.go:254

Ran 908 of 908 Specs in 27.500 seconds
FAIL! -- 907 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestReg (27.72s)
FAIL

Go returning map keys in random order when ranging strikes again! The FakeUI test framework does not allow for very much flexibility in responding to choices for a given input, you just have to give a slice of choices in sequence. This does not work out if the choices are presented in different order each time.

Fixed by sorting the map keys for the problems by type, then ranging over the sorted slice to access the map in a consistent way. Ran the test 150 times over lunch and never had a failure with the fix.

Note that for the user experience, it doesn't really matter, this is purely a testing problem. There are only 6 problem types, so sorting a list of at most 6 does not seem like an expensive operation vs. refactoring how the FakeUI works.

Resolved in dd7ff53

selzoc commented 1 year ago

@klakin-pivotal do you mind testing the fix? Want to make sure it wasn't just fixed on my machine :).

klakin-pivotal commented 1 year ago

@selzoc I checked out dd7ff536a675c61d4c86552035b46e27d67e22a0 (which is currently the latest commit on this branch), ran the tests and they passed. So, it looks like you successfully fixed the busted test.

jpalermo commented 1 year ago

Marking it "experimental" is more of a warning that the behavior might change before we're done with this track of work. For example, maybe we need to change the structure of the recovery plan before we're done.

beyhan commented 1 year ago

Another question I have here is how this relate to the existing cloud-check command? Should we deprecate it in favour of this one at some point? Maybe we can describe the plans around the recover work in https://github.com/cloudfoundry/bosh/discussions like we did for other topics.

selzoc commented 1 year ago

Greate point @beyhan - I've started a discussion here

cunnie commented 1 year ago

✅ CLI vs. old Director

Verified that it emits a proper error message when used with an older Director:

out/bosh create-recovery-plan /tmp/junk3.yml
...

Director does not support this command.  Try 'bosh cloud-check' instead
selzoc commented 1 year ago

Looks good to me (though were I to do this I'd prefer 1-2 commits rather than 5).

Would you like to click the "squash and merge" button? That'll guarantee 1 commit :)