Open TimL20 opened 1 year ago
Is there an easy way to reproduce it (for instance without Gitlab) ?
I have tried several different ways to reproduce it - unfortunately this never happens outside of a Gitlab runner. I tried to replicate the Gitlab Runner setup as much as I could, e.g. running in a Docker Container as Gitlab Runners do (connecting to the same remote host of course). I was unfortunately not able to find a way reproducing this issue without Gitlab Runners... Which is bad, I know, because you can propably not reproduce it. But if you have any idea on how to debug this further, I'm happy to try that out.
I have similar issue on Mac (but also on Windows it's the same) when running molecule inside container (with Docker Desktop). After spending a lot of time trying downgrading versions I realized that the cause was to be found elsewhere and found it in Docker Desktop, in fact natively on Mac (or on the Windows WSL) it worked perfectly. At first, the reasoning was not clear to me but after reading your kernel intuition I realized it could be something like this. In fact months ago everything works and at some point it stopped and returning the error "failed to resolve remote temporary directory ... returned empty string" From here I found out that since Docker Desktop 4.13 they upgraded the kernel from 5.10.x to 5.15.x and this is the root cause. Downgrading Docker Desktop to 4.12 on both Mac and Windows everything started working again. I think this problem also exists on a Linux distribution with a kernel 5.15.x, I can't tell if it's an incompatibility between ansible/molecule and the kernel. I hope someone fix this issue.
I'm seeing a very weird issue, and I'm running out of ideas how to even debug it, so here is a lot of text...
It does not look like this is a bug with Molecule/Molecule-plugins, but removing those I can't reproduce the bug at all (see further down), therefore opening it here.
Background, environment, setup
podman-remote
, onlypodman --remote
, which does not seem to work directly with Molecule-plugins (???)Here is the before_script Part and its output (I added some comments with
#
), this all works:Molecule & error
As the actual step (that fails) I'm running
molecule converge
for now (to look into the containers) with ~ this config:Molecule always creates the instances correctly, I checked it on the remote host as the podmanremote user, I can exec -ti into those podman containers and everything works fine. I also setup podman-remote like on the Gitlab CI Runner for another user, and everything works fine.
The actual issue appears either during prepare or converge steps of Molecule, see below for what happens when. After testing for long I'm very sure that the issue does not have to do with the Ansible task executed when it appears. Here is the most relevant part of what I see with
ANSIBLE_VERBOSITY=3
:After that fatal error Ansible completely stops executing all actions for that instance (here ubuntu2204); that fatal error is also all I see without verbosity. After trying with
ANSIBLE_VERBOSITY=5
I can confirm, that stdout (and also stderr) is an empty string indeed; also RC is 0 for that command. On the remote host however I can see withANSIBLE_KEEP_REMOTE_FILES=1
, that the directory was correctly created (empty of cause because Ansible stopped after creation for that instance/container):When does this occur
Well, that's the thing: spontaneously, not always, not always for the same instance (container), not always for the same task. With my 7 instances I see the issue almost always for at least one, with 5 instances I have more runs without seeing the error. Every playbook for every instance has run through without this error at some point, and a few times this error did not occur at all, but repeating without changing anything at all the error appeared again. I've seen it happing for two instances (at the same task) in the same run, but mostly it only happens for one instance (always a different one), and the others run through without problems. There seems to be a pattern to when this fail occurs though:
ANSIBLE_VERBOSITY=0
at Task 9 or 10 of the converge playbookANSIBLE_VERBOSITY=1
at Task 8 or 9 of the converge playbookANSIBLE_VERBOSITY=3
at Task 1 or 2 (after gather facts) of the prepare playbook (so way earlier)ANSIBLE_VERBOSITY=5
at Task 1 of the prepare playbook or at gathering facts (of prepare, so one task earlier)That is very interesting. Among others, I found these:
What else I tried
The code failing is obviously not in Molecule (/-plugins), so I tried to reproduce that issue:
(1) Set up podman-remote on a different Ubuntu 22.04 machine just like I did in the Gitlab CI Image (including the dirty hack and everything). Check directly working with that -> no problems. So I wrote this shell script:
I run that against the containers created by molecule, which failed against one of those at some point. The issue did not occur.
(2) I wrote a small Ansible task file
I let that repeat 50x (via include_tasks in a test playbook) with the exact same Gitlab CI Runner on Kubernetes with the exact same image with the exact same software against the exact podman-remote setup as above. I copied the inventory from what Molecule created in create and used those exact same intances (containers) that it failed against in that (previous) run. The issue does not occur (I tried multiple times obviously).
Conclusion
This is incredibly weird.
The issue does not seem to be with Molecule, but as soon as a eliminate Molecule from my test setup, I don't see the issue anymore. That's why I'm opening the issue here. I tried more things than I have written down here, but I wrote what I think is the most interesting/important to this, as the text was getting very long already...
I mainly have no idea how to debug this further.