freeipa / ansible-freeipa

Ansible roles and modules for FreeIPA
GNU General Public License v3.0
505 stars 231 forks source link

ipaserver_setup_ca.py fails on first run when using ipaserver role #1084

Open beargiles opened 1 year ago

beargiles commented 1 year ago

I'm consistently seeing this problem when running the ipaserver role for the first time. It takes several minutes to fail - timeout?

      File "/tmp/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload_tct2yhsi/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload.zip/ansible_collections/freeipa/ansible_freeipa/plugins/modules/ipaserver_setup_ca.py", line 417, in <module>
      File "/tmp/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload_tct2yhsi/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload.zip/ansible_collections/freeipa/ansible_freeipa/plugins/modules/ipaserver_setup_ca.py", line 379, in main
      File "/usr/lib/python3.9/site-packages/ipaserver/install/ca.py", line 404, in install_step_0
        ca.configure_instance(
      File "/usr/lib/python3.9/site-packages/ipaserver/install/cainstance.py", line 506, in configure_instance
        self.start_creation(runtime=runtime)
      File "/usr/lib/python3.9/site-packages/ipaserver/install/service.py", line 686, in start_creation
        run_step(full_msg, method)
      File "/usr/lib/python3.9/site-packages/ipaserver/install/service.py", line 672, in run_step
        method()
      File "/usr/lib/python3.9/site-packages/ipaserver/install/cainstance.py", line 646, in __spawn_instance
        DogtagInstance.spawn_instance(
      File "/usr/lib/python3.9/site-packages/ipaserver/install/dogtaginstance.py", line 227, in spawn_instance
        self.handle_setup_error(e)
      File "/usr/lib/python3.9/site-packages/ipaserver/install/dogtaginstance.py", line 604, in handle_setup_error
        raise RuntimeError(
    RuntimeError: CA configuration failed.

It appears to succeed on the second try. ("Appears" since I haven't tested it.)

I have a ton of additional documentation but I just remembered that I already have a task that resets the CA in addition removing the ipaserver role. I suspect at least part of the problem is race condition and a missing directory or file.

---
- name: uninstall-ipa-server | Uninstall prior dogtag PKI
  become: true
  ansible.builtin.command: pkidestroy -s CA -i pki-tomcat
  failed_when: false

- name: uninstall-ipa-server | Remove prior dogtag PKI files
  become: true
  ansible.builtin.file:
    path: '{{ item }}'
    state: absent
  loop:
    - /etc/pki/pki-tomcat
    - /etc/sysconfig/pki-tomcat
    - /etc/sysconfig/pki/tomcat/pki-tomcat
    - /var/lib/pki/pki-tomcat
    - /var/log/pki/pki-tomcat

- name: uninstall-ipa-server | Uninstall prior IPA server
  become: true
  vars:
    ipaserver: '{{ ipaserver_hostname }}'
  ansible.builtin.import_role:
    state: absent
    name: freeipa.ansible_freeipa.ipaserver
  register: ipa_installation_resp
beargiles commented 1 year ago

I forgot to add the few logs in /var/log/pki/pki-tomcat

localhost_access_log.2023-05-01.txt

35.92.229.238 - - [01/May/2023:09:56:41 -0700] "GET / HTTP/1.1" 302 -
35.92.229.238 - - [01/May/2023:09:56:41 -0700] "GET /pki HTTP/1.1" 302 -
35.92.229.238 - - [01/May/2023:09:56:55 -0700] "GET /pki/ HTTP/1.1" 200 3500
35.92.229.238 - - [01/May/2023:09:59:49 -0700] "GET /ca/admin/ca/getStatus HTTP/1.1" 200 119
35.92.229.238 - - [01/May/2023:09:59:49 -0700] "GET /ca/admin/ca/getStatus HTTP/1.1" 200 119

I've been unable to attach the ca/debug.log for some reason, and at 107 lines I would prefer to not inline it, but it only shows "INFO" and seems to succeed.

Finally the ca/signedAudit/ca_audit has

0.https-jsse-nio-8443-exec-1 - [01/May/2023:09:56:38 PDT] [14] [6] [AuditEvent=ACCESS_SESSION_ESTABLISH][ClientIP=--][ServerIP=--][SubjectID=--][Outcome=Success] access session establish success
0.https-jsse-nio-8443-exec-4 - [01/May/2023:09:56:57 PDT] [14] [6] [AuditEvent=ACCESS_SESSION_ESTABLISH][ClientIP=--][ServerIP=--][SubjectID=--][Outcome=Success] access session establish success
0.https-jsse-nio-8443-exec-5 - [01/May/2023:09:59:47 PDT] [14] [6] [AuditEvent=ACCESS_SESSION_ESTABLISH][ClientIP=--][ServerIP=--][SubjectID=--][Outcome=Success] access session establish success
0.https-jsse-nio-8443-exec-5 - [01/May/2023:09:59:50 PDT] [14] [6] [AuditEvent=ACCESS_SESSION_TERMINATED][ClientIP=--][ServerIP=--][SubjectID=--][Outcome=Success][Info=serverAlertSent: CLOSE_NOTIFY] access session terminated
0.https-jsse-nio-8443-exec-4 - [01/May/2023:09:59:50 PDT] [14] [6] [AuditEvent=ACCESS_SESSION_TERMINATED][ClientIP=--][ServerIP=--][SubjectID=--][Outcome=Success][Info=serverAlertSent: CLOSE_NOTIFY] access session terminated
beargiles commented 1 year ago

Two oops!

The first is that I was trying to install ipaserver on Rocky Linux 9. That's not documented as a supported platform yet - but it's clearly very close to being usable.

The second is that I've been running my tests in several terminals and somehow overlooked that the most recent test (using Rocky Linux 8) hadn't set up the proper virtualenv. I don't think this would have affected the RL9 tests but I can double-check that.

beargiles commented 1 year ago

Hmm... the run without the molecule virtualenv failed after about 10 minutes with an error message about timing out while waiting for sudo permissions. (!). It was in the context of connecting to the DBus.

I nuked the prior instance and am now creatng a new one while using the molecule virtualenv. I'm stuck at the same place - this time for 25+ minutes and counting.

t-woerner commented 1 year ago

Hello, do you have errors in the ipaserver-install.log file? The long time (to fail) smells like a DNS issue or a memory issue. Which ansible-freeipa version are you using? Please provide more information about the parameters for the server deployment. Are you configuring the DNS server?

beargiles commented 1 year ago

(Requested information to follow - I have to step away for a meeting but wanted to provide the most recent information, and some context, first.)

For context this is a molecule test using a slightly modified ec2 driver from ansible-community/molecule-plugins. I wrote ansible scripts about 6 months ago that set up baseline AMI but didn't have time to write the molecule tests so I could close the JIRA tickets. Now that I have the time I'm doing a slight refactoring so that I'll be using an Ansible collection with specialized roles instead of a single role with nearly two dozen specialized tasks. This is partly for clarity, partly for potential security auditing.

I let the latest tests run overnight and after nearly 2 hours(!) I saw the error below. That reminded me of issues with LDAP only listening to the IPv6 address - it caused failures since the CA stores its keys in it. I vaguely recall changing a configuration property so I'll check that today.

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: RuntimeError: Unable to retrieve CA chain: [Errno 111] Connection refused

,,,

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/rocky/.ansible/tmp/ansible-tmp-1682997104.0147376-527154-168251518675078/AnsiballZ_ipaserver_setup_ca.py", line 107, in <module>
        _ansiballz_main()
      File "/home/rocky/.ansible/tmp/ansible-tmp-1682997104.0147376-527154-168251518675078/AnsiballZ_ipaserver_setup_ca.py", line 99, in _ansiballz_main
        invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
      File "/home/rocky/.ansible/tmp/ansible-tmp-1682997104.0147376-527154-168251518675078/AnsiballZ_ipaserver_setup_ca.py", line 48, in invoke_module
        run_name='__main__', alter_sys=True)
      File "/usr/lib64/python3.6/runpy.py", line 205, in run_module
        return _run_module_code(code, init_globals, run_name, mod_spec)
      File "/usr/lib64/python3.6/runpy.py", line 96, in _run_module_code
        mod_name, mod_spec, pkg_name, script_name)
      File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/tmp/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload_nqhe55id/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload.zip/ansible_collections/freeipa/ansible_freeipa/plugins/modules/ipaserver_setup_ca.py", line 417, in <module>
      File "/tmp/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload_nqhe55id/ansible_freeipa.ansible_freeipa.ipaserver_setup_ca_payload.zip/ansible_collections/freeipa/ansible_freeipa/plugins/modules/ipaserver_setup_ca.py", line 379, in main
      File "/usr/lib/python3.6/site-packages/ipaserver/install/ca.py", line 355, in install_step_0
        pki_config_override=options.pki_config_override,
      File "/usr/lib/python3.6/site-packages/ipaserver/install/cainstance.py", line 501, in configure_instance
        self.start_creation(runtime=runtime)
      File "/usr/lib/python3.6/site-packages/ipaserver/install/service.py", line 635, in start_creation
        run_step(full_msg, method)
      File "/usr/lib/python3.6/site-packages/ipaserver/install/service.py", line 621, in run_step
        method()
      File "/usr/lib/python3.6/site-packages/ipaserver/install/cainstance.py", line 851, in __request_ra_certificate
        chain = self.__get_ca_chain()
      File "/usr/lib/python3.6/site-packages/ipaserver/install/cainstance.py", line 804, in __get_ca_chain
        raise RuntimeError("Unable to retrieve CA chain: %s" % str(e))
    RuntimeError: Unable to retrieve CA chain: [Errno 111] Connection refused

That reminds me of earlier issues with the LDAP server only listening on the IPv6 port. I t

The 'waiting for privilege escalation prompt' is definitely unrelated since it only took 12s and the ansible host requires a sudo password.

Checking what has changed at this point:

abbra commented 1 year ago

It is perfectly fine that it only listens on IPv6 port. Please read man page for ipv6 to see how modern network stack works in Linux:

IPv4 connections can be handled with the v6 API by using the v4-mapped-on-v6 address type; thus a program needs to support only this API type to support both protocols. This is handled transparently by the address handling functions in the C library. IPv4 and IPv6 share the local port space. When you get an IPv4 connection or packet to an IPv6 socket, its source address will be mapped to v6.

beargiles commented 1 year ago

I've already tried adding {{ ansible_hosts.all_ipv6_addresses }} to the list of IP addresses and it kicked it back since the only IPv6 address provided was the loopback. That doesn't mean the LDAP server won't be happy starting up - but since I can't provide that IP address in the settings then the CA may not know to try that address. Maybe.

It should be easy to modify the EC2 instance so it requests a IPv6 address and retry.

beargiles commented 1 year ago

When I add '::ffff:{{ ansible_host.default_ipv6.address }}' I get

TASK [freeipa.ansible_freeipa.ipaserver : Install - Server preparation] ********
Tuesday 02 May 2023  12:34:26 -0600 (0:00:00.033)       0:02:22.010 ***********
Tuesday 02 May 2023  12:34:26 -0600 (0:00:00.033)       0:02:22.009 ***********
fatal: [molecule-test-freeipa]: FAILED! => changed=false 
  msg: 'Invalid IP Address ::ffff:10.42.73.190: cannot use IANA reserved IP address ::ffff:10.42.73.190'

which seems a little odd since it had no problem accepting the same IPv4 address.

The default IPv6 addresses are either the loopback (host) or link-local (fe80::) so I see a similar failure message, only this time because it's link-local scope.

Finally I created a new subnet that auto-assigns an IPv6 address out of a range managed by AWS - so it's 'global' scope. However I still see a hang at 'Setup CA'. I'm heading into another meeting so I can let it run for a while to see if an error message ever shows up.

FWIW the values I'm sending to the ipaserver role are

ok: [molecule-test-freeipa] => 
  msg: |-
    ipadm_password: DMPassword1

    ipaadmin_principal: admin
    ipaadmin_password: ADMPassword1

    ipaserver: ip-10-42-73-190.us-west-2.compute.internal
    ipaserver_ip_addresses: ['10.42.73.190', '2600:1f13:973:4700:eb7e:39d6:129e:f564']
    ipaserver_domain: example.com
    ipaserver_realm: EXAMPLE.COM
    ipaserver_hostname: ip-10-42-73-190.us-west-2.compute.internal
    ipaserver_no_host_dns: true

    ipaserver_subject_base: dc=example,dc=com
    ipaserver_ca_subject: cn=Certificate Authority,dc=example,dc=com

    ipaserver_setup_dns: true
    ipaserver_allow_zone_overlap: true
    ipaserver_auto_forwarders: true

    ipaserver_setup_firewalld: false

and the ipaserver, ipaserver_hostname, and first of the ipaserver_ip_addresses all match.

.....

Separately I just noticed that this test is still using the default instance size for other tests - that's wildly too small. I've bumped the instance size to 'medium'.

beargiles commented 1 year ago

The script successfully completed with the explicit addition of a global IPv6 address to ipaserver_ip_addresses.

It also succeeds if a global IPv6 address is available but not present in 'ipaserver_ip_addresses'.

Retrying it without a global IPv6 address available.

beargiles commented 1 year ago

Grumble - I've backed out a ton of stuff and include_role still completes. Even 'Rocky Linux 9' works!

At this point I think the only thing remaining is reverting the size of the instance and enabling the 'mem check' flag. I knew it tests more than just the available memory but didn't think to enable it.

I'll mark this closed in a moment but wanted to ask a question before submitting a ticket for it. I know that there are some significant differences between testing individual roles and ansible collections. It looks like the existing tests all use docker - which is fine for testing the ansible code itself.

However docker-based tests have a significant drawback - some platforms require a little more work. E.g., some services need to return an IP address for further work (e.g., an HDFS NameNode provides information about DataNodes) and EC2 instances don't know anything about their public IP address(es). You have to take a few extra steps.

Is it worth the effort to create an issue that provides my ec2-based test? It's not adding a lot - but it might be enough to save other people some effort when they're trying to deploy to EC2.