ansible-collections / ansible-consul

:satellite: Ansible role for Hashicorp Consul clusters
https://galaxy.ansible.com/ansible-community/consul/
BSD 2-Clause "Simplified" License
450 stars 313 forks source link

Add rolling restart on linux. #566

Closed bastienwirtz closed 6 months ago

bastienwirtz commented 1 year ago

This patch makes possible to perform rolling restart on linux. When run on an existing cluster, it should keep the cluster up & running.

nre-ableton commented 1 year ago

I agree regarding the service module. I found another approach to this (inspired by this. It makes possible to use command instead of shell and most importantly, keep the system module. However, it's a bit of a hack, not super dirty but still, I would understand if you don't want to go for it (rolling restart is important for me, I would have to maintain a fork for that, I would love not to have to do that but that's fine!).

Do you mean this answer here? https://serverfault.com/a/967205/2671 -- that's a pretty clever workaround, using run_once with delegate_to and with_items/loop. I wonder if that would work with block (I suspect not) or include_tasks (maybe, but I think that include_tasks doesn't support delegation). Anyways, let me know what you find.

The main drawback of this approach is a bunch of skipped in the ansible output.

Personally, I don't think this is a big problem.

Let me know what you think about it :)

Regarding the fork, you might also be able to work around it by using a wrapper role and executing that in serial. I can do some experiments later and see if that would work.

Edit: As I suspected, include_role also doesn't allow throttle. 😞

bastienwirtz commented 1 year ago

Hello @nre-ableton,

fyi, I've been able to successfully test all 3 PRs (merged into one branch) on multiple scenario (bootstrap new cluster, configuration update on existing one, add new nodes to an existing cluster - all running on Ubuntu) on my production env, all went well.

Let me know if you have any question.

bastienwirtz commented 11 months ago

Just added an optional rolling restart delay. After some usage, I find it more reliable with a small delay to make sure there is enough time for the service to be ready, elect a new leader if necessary ... Let me know what do you think!

bastienwirtz commented 10 months ago

Hey @nre-ableton, let me know if I can do anything to move this and the 2 other PR forward. Thx :pray:

nre-ableton commented 10 months ago

Hey @nre-ableton, let me know if I can do anything to move this and the 2 other PR forward. Thx 🙏

Hi @bastienwirtz, sorry for the delay, I've been on holidays. Also, the problem is that CI is broken for this project, which is unfortunately not easy to fix. The root cause of the CI failures is that Molecule no longer accepts this role's name, and until we can get around that I'm hesitant to merge any new changes to the repo.

I need to spend some time to see if there's a workaround (and I haven't found one yet), or possibly to move this role to an Ansible collection (see https://github.com/ansible-collections/ansible-inclusion/discussions/64#discussioncomment-6962013).

bastienwirtz commented 10 months ago

I understand, no worries @nre-ableton ! I never used Molecule but I'll try to look into it if I can.

bastienwirtz commented 7 months ago

@nre-ableton I agree, an opt-in option is better, I added it and took care of other feedback, linting is happy, Molecule is not 😢, not sure why yet. I manually done some test on Ubuntu VMs with and without rolling restart enabled, it worked fine in both case.

nre-ableton commented 7 months ago

Something is wrong with molecule unrelated to this (and your other) PR. I'm investigating now but might not have time to push a PR today to fix it.

bastienwirtz commented 7 months ago

No rush @nre-ableton, and thanks for looking into it!

nre-ableton commented 6 months ago

CI is fixed (again), can you please rebase? 🙏

bastienwirtz commented 6 months ago

Damn, almost there. Maybe just a retry on the debian-11 tests @nre-ableton? I don't think I can do it myself

nre-ableton commented 6 months ago

Damn, almost there. Maybe just a retry on the debian-11 tests @nre-ableton? I don't think I can do it myself

It's ok, I got it. Looks like an HTTP 500 error from galaxy.

bastienwirtz commented 6 months ago

Updated @nre-ableton ! Thanks for your work on the CI :pray: