ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.04k stars 3.42k forks source link

Project sync fails with "OCI runtime attempted to invoke a command that was not found". #15404

Open fs30000 opened 3 months ago

fs30000 commented 3 months ago

Please confirm the following

Bug Summary

Fresh install of AWX 24.6.1 on Rocky 9.4.

When syncing a project from bitbucket, i got this error:

Error: crun: writing file `/sys/fs/cgroup/libpod_parent/libpod-7e3548e80158e27d349ee7db1ef6a83f4db901135c8393da7e43646db0993fb2/cgroup.procs`: No such file or directory: OCI runtime attempted to invoke a command that was not found

AWX version

24.6.1

Select the relevant components

Installation method

docker development environment

Modifications

no

Ansible version

No response

Operating system

Rocky 9.4

Web browser

Firefox

Steps to reproduce

Create a project with the type git, with credentials, etc. Try to sync it.

Expected results

To work.

Actual results

Error: crun: writing file /sys/fs/cgroup/libpod_parent/libpod-7e3548e80158e27d349ee7db1ef6a83f4db901135c8393da7e43646db0993fb2/cgroup.procs: No such file or directory: OCI runtime attempted to invoke a command that was not found

Additional information

No response

brad95411 commented 2 months ago

Not that this is necessarily helpful, but I am having a nearly identical error running AWX 24.6.1 on docker on Fedora 39, and accessing the web interface via Chrome.

fs30000 commented 2 months ago

Anyone?

fs30000 commented 2 months ago

Same error on Fedora 40. With these commands:

git clone - 23.3.1 export COMPOSE_UP_OPTS=-d RECEPTOR_IMAGE=quay.io/ansible/receptor:v1.4.8 COMPOSE_TAG=release_4.5

brad95411 commented 2 months ago

I've tried a few more versions with no success. Suspect it's some setup problem.

I have read a few things here and there that have said if both the outer and inner container engine is using overlayfs these issues can happen.

Tried changing the selected storage configuration for inner container engine (i.e. podman) to use either vfs or btrfs but I just got errors about it not being able to find either of those.

If I get some time, I will try to set up a podman instance on a VM and set it up so the storage driver is something other than overlayfs and try again. Hopefully get to it some time this weekend, but if someone is itching for an experiment by all means take the idea and run with it.

brad95411 commented 2 months ago

Update:

Fedora 39, AWX 24.5.0 (to keep the UI stuff constrained inside the container), Docker running with vfs storage driver

It's not working, but the error is clearly different. Example job will not start, Example Project will not sync. Error on project sync is show below:

Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

I am at a loss at the moment. Even if this did work, vfs is not exactly a great solution to the problem based on what I've read. If I come up with another idea I'll try it and post about it, but for now I think I'm just going to focus on re-familiarizing myself with ansible-navigator. AWX is helpful because I use AAP all the time for work, but I can get by with navigator for my personal purposes.

fs30000 commented 2 months ago

I have tried older versions, even with different receptor and compose tags for older awx_devel version. Always some error showing up.

When i don't get the error on this issue, i get this:

https://forum.ansible.com/t/error-current-system-boot-id-differs-from-cached-boot-id/7898

I have pulled out all of my hair now.

avalou commented 2 months ago

I have the very same issue with the following version : 24.6.1, 24.5.0, 23.5.0. Anyone have any clue what is happening ? I have been using and installing AWX for 2+ years now and I never ran into this issue.

Edit : we fixed the issue with a colleague of mine. Details are incoming !

fs30000 commented 2 months ago

I have the very same issue with the following version : 24.6.1, 24.5.0, 23.5.0. Anyone have any clue what is happening ? I have been using and installing AWX for 2+ years now and I never ran into this issue.

Edit : we fixed the issue with a colleague of mine. Details are incoming !

Please share mate!

avalou commented 2 months ago

We are still not sure what solved the issue, so here is what we did :

I will keep you posted if we have any more clue about what happened. :woman_shrugging:

fs30000 commented 2 months ago

We are still not sure what solved the issue, so here is what we did :

* downgraded podman version in tools_awx_1 container

* downgraded runc version in tools_awx_1 container

* set `cgroup` to `host` in the compose file under `/tools/docker-compose/_sources/` then rebuilt containers
  At this point the issue was pretty much resolved, but we were not satisfied by this solution that we considered unsafe so we kept digging and removed the `cgroup` parameter

* downgraded docker engine version on the host machine from 27.1.2  to 26.0.0
  This is what seemed to fixed the issue. BUT we are not sure what really worked because when we realised the issue was fixed, we rolled back to the latest docker engine version (in this case v27.1.2, the latest available in apt repositories) and despite that we failed to reproduce the issue.

I will keep you posted if we have any more clue about what happened. 🤷‍♀️

Wait, are you using docker dev version or K8s?

avalou commented 2 months ago

Yes we are using the dev version deployed with docker compose, and it has been working perfectly for 2+ years, with the notable exception of the current topic.

brad95411 commented 2 months ago

Updating on my testing progress.

I am running plain docker, no k8s or anything.

I downgraded crun to 1.14.3-1 from 1.16.1-1 in the Dockerfile jinja template, no change.

I left crun at 1.14, and downgraded podman to 2:5.1.1-1 from 2:5.1.1-1 in the Dockerfile jinja template, no change.

Prior to doing any testing I verified manually with dnf that the versions had changed.

If anyone has achieved any solidity in what has fixed the problem for them and can provide explicit instructions please do. I am still going to keep trying things when I have time, but I feel I may be fighting a losing battle at the moment.

I've not tried a downgrade of the outer docker engine at the moment simply because I have other containers running where I'm doing work now, and would need to set up a new VM to run an additional docker instance that I could more comfortably mess with.

brad95411 commented 2 months ago

Another update.

My docker version was docker-ce-3:26.1.1-1, I guess I didn't realize I was running an older major version.

I upgraded to docker-ce-3:27.1.2-1. It didn't seem to make difference. I am still getting errors. Note that this test is using the crun and podman versions mentioned previously. Current error is shown below

Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

brad95411 commented 2 months ago

I have not had any epiphanies with regards to this issue. I've had to spin down my attempts because of some upgrades I'm making and needing a stable environment while those are going on.

If anyone has any ideas or concrete solutions that have worked for you, please let me know.

brad95411 commented 1 month ago

Updating in hopes to keep this on folks radar, I still haven't been able to solve this.

ibcht commented 1 month ago

We are still not sure what solved the issue, so here is what we did :

* downgraded podman version in tools_awx_1 container

* downgraded runc version in tools_awx_1 container

* set `cgroup` to `host` in the compose file under `/tools/docker-compose/_sources/` then rebuilt containers
  At this point the issue was pretty much resolved, but we were not satisfied by this solution that we considered unsafe so we kept digging and removed the `cgroup` parameter

* downgraded docker engine version on the host machine from 27.1.2  to 26.0.0
  This is what seemed to fixed the issue. BUT we are not sure what really worked because when we realised the issue was fixed, we rolled back to the latest docker engine version (in this case v27.1.2, the latest available in apt repositories) and despite that we failed to reproduce the issue.

I will keep you posted if we have any more clue about what happened. 🤷‍♀️

Similar issue for me, installation on dev environement with Docker Compose, and the solution indeed lies in overriding the cgroup parameter to host in the docker-compose.yml file. It might be related to how containerd determines how the cgroup namespace is configured by default, which could have changed somehow ? https://docs.docker.com/reference/compose-file/services/#cgroup "When unset, it is the container runtime's decision to select which cgroup namespace to use, if supported".

Docker 27.2.1 containerd 1.7.21 cgroup v2

a-haurylau commented 3 weeks ago

See the same. Starting from awx 24.4.1. I suspect that this is caused by podman upgrade in awx image from 4.x to 5.x. For us solution was to add cgroup: host to awx docker-compose.yaml (https://github.com/docker/compose/issues/8167#issuecomment-1791084705)

arlion-dev commented 1 week ago

Same error here