dagwieers commented 6 years ago

Proposal: host-redundancy

Author: Dag Wieers @dagwieers

Date: 2018-02-01

Status: New
Proposal type: core design
Targeted Release: 2.6
Estimated time to implement: unknown

Motivation

Some infrastructure is redundantly set up to not rely on a single node/route/interface. We would like a redundant way to access those resources when using Ansible.

Problems

Use case #1: Some systems are designed for redundancy, e.g. ACI works with a cluster of APICs (3 or more) to manage the ACI fabric. Any of these APICs can be used, and in case one of the APICs is unfit for use (planned or unplanned, isolated or quorum), Ansible could be using a fully-fit APIC to perform the required tasks.
Use case #2: Some systems have more than one interface for redundancy (or convenience), e.g. when managing Windows laptops they can have either a Wired or Wireless IP address, but it's still the same system. We would like to target this system, no matter whether it's connected by wire or wireless. (Because we don't know which one is being used)

Solution proposal

It would be nice if ansible_host could be defined as a list and Ansible would attempt to use the next address if the previous is unreachable. Possibly providing a methodology on how to select the order to try (e.g. consecutively, randomly, ...)
This would work equally well for delegate_to
If we also want to iterate over host+port, we may need a different way of specifying that

bcoca commented 6 years ago

instead of hacking this into ansible_host .. an inventory plugin could resolve it, see https://github.com/ansible/ansible/pull/32857 as an example. Instead of querying the network, go over provided list and return to ansible 'first reachable IP' for that host.

dagwieers commented 6 years ago

I don't think you fully understand what we need here.

In the case of Cisco ACI we would need to connect to one APIC in the cluster, authenticate to it and check the status of the APIC in the cluster (is it read-only or read-write) to understand whether this APIC is one we can use for communication.

This means the the inventory plugin should be connecting, authenticating and quering the device before returning it in the inventory. If during the playbook run things change, this would fail hard. That's not what we are looking for, we are looking for a solution where the connection is aware of this.

This means the inventory needs to include a list of hosts, the persistent connection is testing which APIC in the cluster can be used, and in case the cluster somehow no longer has quorum, start using the right APIC from the cluster (not necessarily the one it was using).

So doing this once before the playbook-run, or some time in advance is NOT going to work. It will not offer any redundancy.

bcoca commented 6 years ago

vars plugin then

dagwieers commented 6 years ago

Doesn't make sense either, sorry.

It's part of making and maintaining the persistent connection. It does not belong in a vars plugin, and it wouldn't work because vars are evaluated early as well.

dagwieers commented 6 years ago

@sivel Why the thumbs-down ?

bcoca commented 6 years ago

@dagwieers no, they are not evaluated early, that changed in 2.4

though im confused, if they are changing .. it wont be a persistent connection ...

dagwieers commented 6 years ago

@bcoca They are not evaluated within the connection plugin or module when a connection error, or a cluster state change, has happened. So if the connection plugin or the module only received a single host, and that host is not (no longer) valid as a host, how would the vars plugin kick in and provide a different host ?

So yes, a vars plugin is evaluated too early.

sivel commented 6 years ago

@sivel Why the thumbs-down ?

I don't like the idea of this. I'd much rather be explicit. This was proposed before, we came back to saying to use (at the time) wait_for, and validate which was up, and use that going forward.

This is much the same as "I don't know if I boot strapped the host yet, and I change the SSH port, which is correct?" I even have my playbook used as the response to that request:

- hosts: all
  gather_facts: false
  vars:
    ansible_ssh_port: 3121
  pre_tasks:
    - local_action:
        module: wait_for
        port: "{{ ansible_ssh_port|default(22) }}"
        host: "{{ ansible_ssh_host|default(inventory_hostname) }}"
        timeout: 10
      register: result
      ignore_errors: true

    - set_fact:
        ansible_ssh_port: 22
      when: result|failed

    - setup:

In the end I don't think it should be core functionality of connections. Use wait_for, ping, wait_for_connection, some other module designed to test this, and go with that information.

bcoca commented 6 years ago

@dagwieers the point was doing the validation on the plugin and returning onlly the 'valid' data for connection to consume

dagwieers commented 6 years ago

So again, both @bcoca and @sivel are not reading into what we need. I guess I am not explaining myself.

We have 3 or more APICs in a cluster. Each of these APICs can be used, but you only know by connecting to one and see if they are part of the cluster. (authenticate, query for node status in cluster). This is related to the ACI REST connection plugin (currently it is part of the module, but that's going to change).

At any point in time the node you are talking to could no longer be a node that you can use for making changes, because it may be isolated from the cluster for whatever reason. Sure it could suddenly by down as well, that's the easier case I guess.

So any solution where you would check before running a task, or where you provide one working APIC is not a redundant solution, because the very next moment that system may not be working and we still have at least 2 other nodes that are working fine.

(A single module is doing multiple requests to the APIC, so every one of these requests could fail and require an evaluation to use a different APIC)

So the solution for real redundancy cannot come from the inventory, or from a variable, it needs to come from the layer that makes the connection, manages the connection or reconnects when there are issues. That layer needs to be aware of the existing nodes, the node being used and the fallback options.

So any solution that makes the decision beforehand will never work. Because it means that the layer that actually needs the whole picture only knows about a single node. FAIL

dagwieers commented 6 years ago

valid is in the eye of the beholder, it was valid at the time of testing, but may no longer be valid at the time of using.

And valid here could mean making the actual connection and requesting the status from the system, which I don't want to duplicate in an inventory plugin or vars plugin, not only because it does not make sense, but because it is irrelevant.

dagwieers commented 6 years ago

cc @rsmeyers

rsmeyers commented 6 years ago

In my eyes Dag's explanation seems very valid. We btw could have a similar functionality requirement with other controllers, for example OpenStack controllers, we have a similar behaviour there.

bcoca commented 6 years ago

@dagwieers no need to re-implement, can use same connection code ... not sure how things can change that fast between test and usage .. but then wouldn't that also change mid connection and between commands?

I think what I am missing is what would cause these seemingly uncontrolled changes from one second to the next and is it really ansible's job to compensate for them constantly?

in any case, it should be possible to build this into the specific plugins w/o modifying ansible core.

bcoca commented 6 years ago

@dagwieers just tested, there is nothing in core ansible that validates that ansible_host is a string, it will just choke at the specific connection plugin used if it expects otherwise, so an aci connection should be safe?

dagwieers commented 6 years ago

@bcoca So on one hand we have the need for ACI, but I am looking at the general case as well. Multi-homed systems, etc... So I don't want to do this only for ACI, but also allow this for other connection types, like SSH and/or WinRM (use case #2).

I think what I am missing is what would cause these seemingly uncontrolled changes from one second to the next and is it really ansible's job to compensate for them constantly?

It does not matter what is causing it, but Ansible should be able to work redundantly if there's a high-available setup. It's not my wish, it's what customers are demanding. But as I already indicated, this could be planned downtime (migration, upgrade, ...), network-related issues, hardware failure, software errors, or whatever reason people demand redundancy for highly-critical systems.

bcoca commented 6 years ago

sorry, but for me 'highly available setup' would mean that the connection info should always work ... I seem to not be getting 'somethign' here.

in any case you can easily do a proof of concept with a new 'ssh_cluster_aware' connection plugin

rsmeyers commented 6 years ago

@bcoca More and more we see high available setups not using a VIP anymore, as this adds a complexity which in the end could also go wrong, so they put it on the client to determine who is active and available, hence, this needs to be controlled from Ansible side

jhg03a commented 4 years ago

My thoughts are suppose I have a list of IPs for a given physical server. SSH may only be listening on some of them, or based on where the control host is not all may be reachable.

From a security stance it's not stellar, but in the event that the list gets out of date and somehow starts spanning multiple hosts there will be a SSH hostkey missmatch error. However this leaves the door open in the case where the entire list is new and you don't know if the host key is the one you want.

ansible / proposals

Make ansible_host a list of possible addresses #97

Proposal: host-redundancy

Motivation

Problems

Solution proposal