Single server. Recovery from image on server with different IP fails.

jalvarezferr commented 9 months ago

Bug description

In scenarios where a single server installation is recovered from an image of the server into a new machine with different IP, services do not work because they are configured with the old server's IP hardcoded in configuration files.

This is a scenario that affects particularly cloud environments where backups are taken using snapshots, or autoscaling systems are used to provide fault tolerance.

The playbooks seem to want to preserve the pre-1.2.0 behavior where single server installations defaulted to use 127.0.0.1 as IP for all the components. But the Jinja expression used does not behave that way, as the filter includes the obtention of the server's IP from hostvars, which is invariably available and therefore always used.

We have corrected the issue by introducing in the inventory a new custom variable for hosts, named dns to explicitly declare the host name to be used when configuring connectivity between components, in the understanding that it can differ from the ansible inventory host name in some use cases; and in its absence defaulting to the inventory hostname, which in single server deployments is localhost. As an example applied to roles/common/defaults/main.yml:

activemq_host: "{% if groups.external_activemq | default(False) %}\
  {{- groups.external_activemq | first -}}\
  {% else %}\
  {% set host_vars=hostvars[groups['activemq']|first] %}\
  {{- host_vars['dns'] if 'dns' in host_vars else host_vars['inventory_hostname'] | trim -}}\
  {% endif %}"

In the case of repo_hosts the logic is a bit different:

repo_hosts_str: |- 
  [
  {% for host in groups['repository'] %}
    {% if hostvars[host].inventory_hostname == 'localhost' %}
      {
        inventory_name: {{ hostvars[host].inventory_hostname }},
        local_addr: 127.0.0.1,
        dns: localhost,
        cluster_keepoff: {{ hostvars[host].cluster_keepoff | default(false) }}
      },
    {% else %}
      {
        inventory_name: {{ hostvars[host].inventory_hostname }},
        local_addr: {{ hostvars[host].ansible_default_ipv4.address }},
        {% if 'use_dns' in hostvars['host'] %}
        dns: {{ hostvars[host].use_dns }},
        {% else %}
        dns: {{ hostvars[host].inventory_hostname }},
        {% endif %}
        cluster_keepoff: {{ hostvars[host].cluster_keepoff | default(false) }}
      },
    {% endif %}
  {% endfor %}
  ]
repo_hosts: "{{ repo_hosts_str | from_yaml }}"

The repo_host variable defined as the address of the first repository host in the inventory seems to not be used in the playbooks.

There are other instances of files where IPs instead of DNS names are used, like on the template for custom-slingshot-application-context.xml. We generally consider a best practice to use DNS names in all configurations, to protect the deployment from IP changes that can happen for multiple reasons, also in the traditional on-prem environments, but most specially in cloud set ups.

Target OS

Any

Host OS

Any

Playbook version

1.2.0 onwards

Ansible error

No ansible error related

Ansible context

Not relevant

ansible --version

Not relevant

ansible-config dump --only-changed

Not relevant

ansible-inventory -i your_inventory_file --graph

Default inventory_local.yml

pip list

Not relevant

gionn commented 9 months ago

Hello, I understand that hardcoding IP address in configuration files is not really a good practice but I guess we cannot assume that every deployment involving multiple nodes will have full DNS resolution working with every node.

On another side, nothing prevent you to run the playbook again after infrastructure changes, so that IP addresses in configuration files can be eventually refreshed, but I understand it can be tedious to do especially after recovering from unplanned outages.

I feel we should take in consideration this issue but atm not sure if the direction we should take is really to handle hosts reference in two different possibile ways (ip address or fqdn when dns resolution is available)

jalvarezferr commented 9 months ago

It is not just a recovery scenario where recovery activities are expected, but in our case the set up of an autoscaling mechanism to automatically fail over to another server. Executing the playbooks again after a long time, on a recovered server which, no matter how much we strive to, no one can guarantee they will still run without error, is not an option in an automated scenario like that. The AMI must boot into a new instance and be operative. The alternative is to develop a playbook to reconfigure all IPs in all places to run as part of the new server boot. That all is anyway unnecesary and ilogical in a single server scenario, where simply doing as we did and configuring all using the loopback address is what used to be done and what is logical to do. As I mentioned, the current playbooks use a Jinja expression that defaults to 127.0.0.1, like the old ones did but in a way that make it never apply to the intended case (the single server deploy). Al least that expression needs fixing, and that would be enough for the single server case.

jalvarezferr commented 9 months ago

On a side note about this same topic. Configuring NGINX with locahost has the side effect of causing errors when trying the IPV6 [::1] loopback address, which is the preferred by the system, because Tomcat is only listenting on IPv4 because of the use of the java.net.preferIPv4Stack directive. Connections eventually succeed because NGINX then tries the 127.0.0.1 address that is resolved too, but it is at minimum an annoyance that polutes the error log, and can be seen too as a performance degradation issue, though minimal.

We'd like to know if there's some profound reason for explicitly restricting Tomcat to use IPv4, before goin on and removing that flag to make IPv6 connections work.

Alfresco / alfresco-ansible-deployment