TimL20 commented 1 year ago

I'm seeing a very weird issue, and I'm running out of ideas how to even debug it, so here is a lot of text...

It does not look like this is a bug with Molecule/Molecule-plugins, but removing those I can't reproduce the bug at all (see further down), therefore opening it here.

Background, environment, setup

Server srv1234.localdomain
- (I edited that hostname and domain for this issue everywhere here)
- Ubuntu 22.04
- Podman is setup for podman-remote rootless for user podmanremote with UID 3023
- Podman verion is 3.4.4 (installed via apt-get from official Ubuntu repo)
- This I have confirmed to work from other machines
Gitlab CI Runner
- I tested on two setups, for one the Runner was a Docker Container, for the other on Kubernetes, outcome is the same
- Image is an (adjusted) Ubuntu 22.04 with Ansible, Molecule and more installed
- Also with this dirty hack:
```
echo '#!/bin/sh\npodman --remote "$@"' > /usr/local/bin/podman-remote
chmod 0777 /usr/local/bin/podman-remote
```
- Ubuntu 22.04 does not have podman-remote, only podman --remote, which does not seem to work directly with Molecule-plugins (???)
- I did not have a better idea (other than using a different distro, that's maybe later)
- podman-remote because I do not control the Kubernetes cluster, and by default it does not allow to mount a filesystem, which I would need to run podman directly
- Again, this is just because I had no better idea

Here is the before_script Part and its output (I added some comments with #), this all works:

# Debug Ansible version
$ ansible --version
ansible [core 2.14.3]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True
# Debug Molecule version
$ molecule --version
molecule 4.0.4 using python 3.10
    ansible:2.14.3
    azure:23.0.0 from molecule_plugins
    containers:23.0.0 from molecule_plugins requiring collections: ansible.posix>=1.3.0 community.docker>=1.9.1 containers.podman>=1.8.1
    delegated:4.0.4 from molecule
    docker:23.0.0 from molecule_plugins requiring collections: community.docker>=3.0.2 ansible.posix>=1.4.0
    ec2:23.0.0 from molecule_plugins
    gce:23.0.0 from molecule_plugins requiring collections: google.cloud>=1.0.2 community.crypto>=1.8.0
    podman:23.0.0 from molecule_plugins requiring collections: containers.podman>=1.7.0 ansible.posix>=1.3.0
    vagrant:23.0.0 from molecule_plugins

# Check SSH connection: sanity check && it adds the host key to known_hosts which can fix issues
$ ssh -i "$REMOTE_PODMAN_SSH_KEY" -o PasswordAuthentication=no -o StrictHostKeyChecking=no podmanremote@srv1234.localdomain '/bin/sh -c "echo 'SSH connection successful'"'
Warning: Permanently added 'srv1234.localdomain' (ED25519) to the list of known hosts.
SSH connection successful
# Setup podman-remote connection, set it to default, use key auth
# srv1234.localdomain is the remote host, where Podman is set up (rootless)
# podmanremote with UID 3023 is the user on the remote machine
$ podman-remote system connection add --default podmanremote --identity "$REMOTE_PODMAN_SSH_KEY" ssh://podmanremote@srv1234.localdomain/run/user/3023/podman/podman.sock

# Check remote connection setup
$ podman-remote info --debug
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: Unknown
    path: /usr/local/libexec/podman/conmon
    version: 'conmon version 2.1.7, commit: 4192e84f9d2edd07ee142d04fa5614e71f3a8349'
  cpus: 2
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: journald
  hostname: srv1234
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 3023
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 3023
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.15.0-1029-kvm
  linkmode: dynamic
  logDriver: journald
  memFree: 651550720
  memTotal: 4130914304
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.17
      commit: 0e9229ae34caaebcb86f1fde18de3acaf18c6d9a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/3023/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.0.1
      commit: 6a7b16babc95b6a3056b33fb45b74a6f62262dd4
      libslirp: 4.6.1
  swapFree: 0
  swapTotal: 0
  uptime: 21h 40m 44.66s (Approximately 0.88 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/podmanremote/.config/containers/storage.conf
  containerStore:
    number: 27
    paused: 0
    running: 7
    stopped: 20
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/podmanremote/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 10
  runRoot: /run/user/3023/containers
  volumePath: /home/podmanremote/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.4
  Built: 0
  BuiltTime: Thu Jan  1 01:00:00 1970
  GitCommit: ""
  GoVersion: go1.17.3
  OsArch: linux/amd64
  Version: 3.4.4

# Make molecule-plugins use podman-remote instead of using Podman directly
# This "redirects" to "podman --remote", setup in the Dockerfile of the controller image
$ export MOLECULE_PODMAN_EXECUTABLE="podman-remote"

Molecule & error

As the actual step (that fails) I'm running molecule converge for now (to look into the containers) with ~ this config:

dependency:
  name: galaxy
driver:
  name: podman
platforms:
  - name: ubuntu2204
    image: library/ubuntu:22.04
  - name: ubuntu2004
    image: library/ubuntu:20.04
  - name: ubuntu1804
    image: library/ubuntu:18.04
  - name: debian10
    image: library/debian:10
  - name: debian11
    image: library/debian:11
  - name: rhel9
    image: redhat/ubi9-init
  - name: rhel8
    image: redhat/ubi8-init
provisioner:
  name: ansible
  config_options:
    defaults:
      interpreter_python: auto_silent
      callbacks_enabled: profile_tasks, timer, yaml
  env:
    ANSIBLE_VERBOSITY: 3  # 3/4/5 for debugging
verifier:
  name: ansible
scenario:
  name: default
  test_sequence:
    - create
    - prepare
    - converge
    - idempotence
    - verify
    - destroy

Molecule always creates the instances correctly, I checked it on the remote host as the podmanremote user, I can exec -ti into those podman containers and everything works fine. I also setup podman-remote like on the Gitlab CI Runner for another user, and everything works fine.

The actual issue appears either during prepare or converge steps of Molecule, see below for what happens when. After testing for long I'm very sure that the issue does not have to do with the Ansible task executed when it appears. Here is the most relevant part of what I see with ANSIBLE_VERBOSITY=3:

TASK [Install required packages] ***********************************************
task path: /builds/mydepartment/ansible/roles/mycompany.telegraf/molecule/default/prepare.yml:62
Friday 17 March 2023  14:35:10 +0000 (0:00:00.489)       0:01:05.869 **********
redirecting (type: connection) ansible.builtin.podman to containers.podman.podman
Loading collection containers.podman from /root/.cache/ansible-compat/e3e73f/collections/ansible_collections/containers/podman
Using podman connection from collection
Running ansible.legacy.apt

<ubuntu2204> RUN [b'/usr/local/bin/podman-remote', b'mount', b'ubuntu2204']
Failed to mount container ubuntu2204: b"Error: unrecognized command `podman mount`\nTry 'podman --help' for more information."
<ubuntu2204> RUN [b'/usr/local/bin/podman-remote', b'exec', b'ubuntu2204', b'/bin/sh', b'-c', b'echo ~ && sleep 0']
Using module file /usr/local/lib/python3.10/dist-packages/ansible/modules/apt.py
<ubuntu2204> RUN [b'/usr/local/bin/podman-remote', b'exec', b'ubuntu2204', b'/bin/sh', b'-c', b'( umask 77 && mkdir -p "` echo /root/.ansible/tmp `"&& mkdir "` echo /root/.ansible/tmp/ansible-tmp-1679063713.1625075-2880-206913881931277 `" && echo ansible-tmp-1679063713.1625075-2880-206913881931277="` echo /root/.ansible/tmp/ansible-tmp-1679063713.1625075-2880-206913881931277 `" ) && sleep 0']

fatal: [ubuntu2204]: FAILED! => {
    "msg": "failed to resolve remote temporary directory from ansible-tmp-1679063713.1625075-2880-206913881931277: `( umask 77 && mkdir -p \"` echo /root/.ansible/tmp `\"&& mkdir \"` echo /root/.ansible/tmp/ansible-tmp-1679063713.1625075-2880-206913881931277 `\" && echo ansible-tmp-1679063713.1625075-2880-206913881931277=\"` echo /root/.ansible/tmp/ansible-tmp-1679063713.1625075-2880-206913881931277 `\" )` returned empty string"
}

After that fatal error Ansible completely stops executing all actions for that instance (here ubuntu2204); that fatal error is also all I see without verbosity. After trying with ANSIBLE_VERBOSITY=5 I can confirm, that stdout (and also stderr) is an empty string indeed; also RC is 0 for that command. On the remote host however I can see with ANSIBLE_KEEP_REMOTE_FILES=1, that the directory was correctly created (empty of cause because Ansible stopped after creation for that instance/container):

tim@srv1234:~$ podman-remote exec -ti ubuntu2204 bash
root@ubuntu2204:/# cd /root/.ansible/tmp
root@ubuntu2204:~/.ansible/tmp# ls -la
total 28
drwx------ 7 root root 4096 Mar 17 14:35 .
drwx------ 3 root root 4096 Mar 17 14:34 ..
drwx------ 2 root root 4096 Mar 17 14:34 ansible-tmp-1679063647.1856043-757-97474892703325
drwx------ 2 root root 4096 Mar 17 14:34 ansible-tmp-1679063657.7830272-1303-25603066793554
drwx------ 2 root root 4096 Mar 17 14:34 ansible-tmp-1679063667.9987888-1957-160947105671679
drwx------ 2 root root 4096 Mar 17 14:34 ansible-tmp-1679063681.3135738-2315-125828545605814
drwx------ 2 root root 4096 Mar 17 14:35 ansible-tmp-1679063713.1625075-2880-206913881931277
root@ubuntu2204:~/.ansible/tmp# ls -la /root/.ansible/tmp/ansible-tmp-1679063713.1625075-2880-206913881931277
total 8
drwx------ 2 root root 4096 Mar 17 14:35 .
drwx------ 7 root root 4096 Mar 17 14:35 ..

When does this occur

Well, that's the thing: spontaneously, not always, not always for the same instance (container), not always for the same task. With my 7 instances I see the issue almost always for at least one, with 5 instances I have more runs without seeing the error. Every playbook for every instance has run through without this error at some point, and a few times this error did not occur at all, but repeating without changing anything at all the error appeared again. I've seen it happing for two instances (at the same task) in the same run, but mostly it only happens for one instance (always a different one), and the others run through without problems. There seems to be a pattern to when this fail occurs though:

With ANSIBLE_VERBOSITY=0 at Task 9 or 10 of the converge playbook
With ANSIBLE_VERBOSITY=1 at Task 8 or 9 of the converge playbook
With ANSIBLE_VERBOSITY=3 at Task 1 or 2 (after gather facts) of the prepare playbook (so way earlier)
With ANSIBLE_VERBOSITY=5 at Task 1 of the prepare playbook or at gathering facts (of prepare, so one task earlier)

That is very interesting. Among others, I found these:

https://github.com/ansible/ansible/issues/13876
- But it sounds like those bugs are fixed
https://groups.google.com/g/ansible-project/c/FyK6au2O9KY/m/PS4yH6Y2AwAJ
- Relation to timing would explain, why a higher verbosity makes the issue appear earlier (because more stdout print takes more time)
- But podman-remote does SSH handling itself, and as far as I can see, podman-remote does not allow any of those SSH options

What else I tried

The code failing is obviously not in Molecule (/-plugins), so I tried to reproduce that issue:

(1) Set up podman-remote on a different Ubuntu 22.04 machine just like I did in the Gitlab CI Image (including the dirty hack and everything). Check directly working with that -> no problems. So I wrote this shell script:

#!/bin/sh
for i in $(seq 0 500)
do
        date
        echo -n "TRY"
        podman-remote exec ubuntu2204 /bin/sh -c "( umask 77 && mkdir -p /tmp/t1/ && mkdir /tmp/t1/$i && echo YES ) && sleep 0"
        echo -n "TRY"
        podman-remote exec ubuntu2004 /bin/sh -c "( umask 77 && mkdir -p /tmp/t1/ && mkdir /tmp/t1/$i && echo YES ) && sleep 0"
        echo -n "TRY"
        podman-remote exec ubuntu1804 /bin/sh -c "( umask 77 && mkdir -p /tmp/t1/ && mkdir /tmp/t1/$i && echo YES ) && sleep 0"
        echo -n "TRY"
        podman-remote exec debian10 /bin/sh -c "( umask 77 && mkdir -p /tmp/t1/ && mkdir /tmp/t1/$i && echo YES ) && sleep 0"
        echo -n "TRY"
        podman-remote exec debian11 /bin/sh -c "( umask 77 && mkdir -p /tmp/t1/ && mkdir /tmp/t1/$i && echo YES ) && sleep 0"
        echo -n "TRY"
        podman-remote exec rhel8 /bin/sh -c "( umask 77 && mkdir -p /tmp/t1/ && mkdir /tmp/t1/$i && echo YES ) && sleep 0"
        echo -n "TRY"
        podman-remote exec rhel9 /bin/sh -c "( umask 77 && mkdir -p /tmp/t1/ && mkdir /tmp/t1/$i && echo YES ) && sleep 0"
        sleep 1
done

I run that against the containers created by molecule, which failed against one of those at some point. The issue did not occur.

(2) I wrote a small Ansible task file

- name: Update apt cache {{ item }}
  ansible.builtin.apt:
    update_cache: yes
  when: ansible_pkg_mgr == "apt"

- name: Update dnf cache {{ item }}
  ansible.builtin.dnf:
    update_cache: yes
  environment:  # Explicitely set proxy
    http_proxy: "{{ lookup('ansible.builtin.env', 'http_proxy') }}"
    https_proxy: "{{ lookup('ansible.builtin.env', 'https_proxy') }}"
    no_proxy: "{{ lookup('ansible.builtin.env', 'no_proxy') }}"
  when: ansible_pkg_mgr == "dnf"

- name: Sleep random amount of time {{ item }}
  ansible.builtin.pause:
    seconds: "{{ [0, 1] | ansible.builtin.random }}"

I let that repeat 50x (via include_tasks in a test playbook) with the exact same Gitlab CI Runner on Kubernetes with the exact same image with the exact same software against the exact podman-remote setup as above. I copied the inventory from what Molecule created in create and used those exact same intances (containers) that it failed against in that (previous) run. The issue does not occur (I tried multiple times obviously).

Conclusion

This is incredibly weird.

The issue does not seem to be with Molecule, but as soon as a eliminate Molecule from my test setup, I don't see the issue anymore. That's why I'm opening the issue here. I tried more things than I have written down here, but I wrote what I think is the most interesting/important to this, as the text was getting very long already...

I mainly have no idea how to debug this further.

apatard commented 1 year ago

Is there an easy way to reproduce it (for instance without Gitlab) ?

TimL20 commented 1 year ago

I have tried several different ways to reproduce it - unfortunately this never happens outside of a Gitlab runner. I tried to replicate the Gitlab Runner setup as much as I could, e.g. running in a Docker Container as Gitlab Runners do (connecting to the same remote host of course). I was unfortunately not able to find a way reproducing this issue without Gitlab Runners... Which is bad, I know, because you can propably not reproduce it. But if you have any idea on how to debug this further, I'm happy to try that out.

calillo commented 1 year ago

I have similar issue on Mac (but also on Windows it's the same) when running molecule inside container (with Docker Desktop). After spending a lot of time trying downgrading versions I realized that the cause was to be found elsewhere and found it in Docker Desktop, in fact natively on Mac (or on the Windows WSL) it worked perfectly. At first, the reasoning was not clear to me but after reading your kernel intuition I realized it could be something like this. In fact months ago everything works and at some point it stopped and returning the error "failed to resolve remote temporary directory ... returned empty string" From here I found out that since Docker Desktop 4.13 they upgraded the kernel from 5.10.x to 5.15.x and this is the root cause. Downgrading Docker Desktop to 4.12 on both Mac and Windows everything started working again. I think this problem also exists on a Linux distribution with a kernel 5.15.x, I can't tell if it's an incompatibility between ansible/molecule and the kernel. I hope someone fix this issue.

ansible-community / molecule-plugins

Podman-remote: failed to resolve remote temporary directory ... returned empty string #113

Background, environment, setup

Molecule & error

When does this occur

What else I tried

Conclusion