ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.5k stars 3.34k forks source link

AWX v24.3.0 update breaks deployment (psycopg connection issue?) #15136

Closed ilbarone87 closed 2 weeks ago

ilbarone87 commented 3 weeks ago

Please confirm the following

Bug Summary

Hello after updating to Awx-operator 2.16.0 with awx image 2.43.0 the Awx-Task pods is not able to start anymore and it throws these errors

[wait-for-migrations] Waiting for database migrations...
[wait-for-migrations] Attempt 1
Instance Group already registered controlplane
Instance Group already registered default
Successfully registered instance None
(changed: True)
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 INFO RPC interface 'supervisor' initialized
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:25,615 INFO supervisord started with pid 6
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,619 INFO spawned: 'superwatcher' with pid 36
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,621 INFO spawned: 'dispatcher' with pid 37
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,624 INFO spawned: 'callback-receiver' with pid 38
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
2024-04-23 23:36:26,641 INFO spawned: 'wsrelay' with pid 39
READY
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,647 INFO success: superwatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-04-23 23:36:27,954 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
2024-04-23 23:36:28,011 INFO     [-] awx.main.commands.run_callback_receiver Callback receiver started with pid=38
2024-04-23 23:36:28,023 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: dispatcher (terminated by SIGABRT (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:28,089 WARN exited: wsrelay (terminated by SIGSEGV (core dumped); not expected)
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,091 INFO spawned: 'dispatcher' with pid 67
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:29,092 INFO spawned: 'wsrelay' with pid 68
2024-04-23 23:36:30,307 INFO     [-] awx.main.wsrelay Active instance with hostname awx-task-99dc479f7-szzdp is registered.
2024-04-23 23:36:30,313 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'awx-task-99dc479f7-szzdp']
double free or corruption (out)
double free or corruption (out)

After downgrading to 2.15 it starts to work again. I have awx deployed on rke2 1.27 stable version. Awx is using an external Postgres 15.1 version

AWX version

2.43.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Ubuntu 22.04

Web browser

Safari

Steps to reproduce

Upgrade to Awx-oprator 2.16.0 and awx 2.43.0

Expected results

Awx to start correctly

Actual results

Awx-task pod in crash

Additional information

No response

ryantking commented 3 weeks ago

I think it has to do with this bug: https://www.postgresql.org/message-id/CX9SU44GH3P4.17X6ZZUJ5D40N@neon.tech

After some poking around I was able to find that the segfault occurs when creating the psycopg connection, and confirmed that its a known bug in this stackoverflow: https://stackoverflow.com/questions/77619281/double-free-or-corruption-out-from-psycopg2-connect-using-python-3-11-6

Nyralei commented 3 weeks ago

Observed the same

gundalow commented 3 weeks ago

@ilbarone87 @Nyralei thank you for reporting this and the troubleshooting you've done already.

We will look at this as a priority.

ilbarone87 commented 3 weeks ago

If this can help, have digged a little bit more into the issue and postgres was throwing this error:

{
  "@timestamp": [
    "2024-04-23T21:26:31.527Z"
  ],
  "container": [
    "postgres"
  ],
  "message": [
    "{\"level\":\"info\",\"ts\":\"2024-04-23T21:26:31Z\",\"logger\":\"postgres\",\"msg\":\"record\",\"logging_pod\":\"postgres-cluster-awx-1\",\"record\":{\"log_time\":\"2024-04-23 21:26:31.526 UTC\",\"process_id\":\"956926\",\"connection_from\":\"10.42.1.189:38644\",\"session_id\":\"66282787.e99fe\",\"session_line_num\":\"1\",\"session_start_time\":\"2024-04-23 21:26:31 UTC\",\"transaction_id\":\"0\",\"error_severity\":\"FATAL\",\"sql_state_code\":\"28000\",\"message\":\"no PostgreSQL user name specified in startup packet\",\"backend_type\":\"not initialized\",\"query_id\":\"0\"}}"
  ],
  "namespace": [
    "cnpg-system"
  ]
dmzoneill commented 3 weeks ago

Hi folks, So I have reviewed the source on postgresql website for the latest releases, and they are patched in relation to the linked bug. That's said we are still trying to confirm the source of the issue you have indicated.

As for downgrading awx-operator to 2.15 to resolve this issue, I can't see any difference between 2.15 and 2.16 would cause this issue. I suspect we are missing part of the story here. https://gist.github.com/dmzoneill/497746f38c5786c96e8859f1131667af

What version of AWX did you upgrade from?

I'm using

bash-5.1$ rpm -qa | grep post
postgresql-private-libs-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-server-15.2-1.module_el9+264+92dde3f0.x86_64
postgresql-contrib-15.2-1.module_el9+264+92dde3f0.x86_64
bash-5.1$ pip3.11 list | grep cop
psycopg                     3.1.18
bash-5.1$ python3.11 --version
Python 3.11.7

If you could provide more specific versioning, that may help, thank you

Nyralei commented 3 weeks ago

@dmzoneill I think what @ilbarone87 actually means by downgrading to 2.15 version of awx-operator is that by downgrading operator awx image tag changes to 24.2.0 too. So the issue lies in awx 24.3.0 itself and not operator. I observe the same error - "double free or corruption (out)" when changing image_version to 24.3.0 from 24.2.0.

ilbarone87 commented 3 weeks ago

@dmzoneill what @Nyralei is saying is correct, the problem is on the awx image 24.2.0, that's why I posted the issue here and not in the awx-operator repo. The mention of the operator version was just because they are bundled together.

dmzoneill commented 3 weeks ago

Digging through the diff doesn't indicate a change this would cause this behaviour. If you can continue to provide more info, that would be great. thanks

git diff 24.2.0 24.3.0
mensler commented 3 weeks ago

We have the same problem with 23.9.0 and the docker-compose setup, re-installed yesterday.

My guess is that is has to do with the base image, where OpenSSL got updated to 3.2.1 :thinking:

kurokobo commented 3 weeks ago

This difference?

$ docker run --rm -it quay.io/ansible/awx:24.2.0 dnf list installed openssl*
Installed Packages
openssl.x86_64                             1:3.0.7-27.el9                        @baseos
openssl-libs.x86_64                        1:3.0.7-27.el9                        @System

$ docker run --rm -it quay.io/ansible/awx:24.3.0 dnf list installed openssl*
Installed Packages
openssl.x86_64                             1:3.2.1-1.el9                        @baseos
openssl-libs.x86_64                        1:3.2.1-1.el9                        @System
fosterseth commented 3 weeks ago

quay.io/fosterseth/awx:openssl307

here is an image based on awx v24.3.0 but with a downgraded openssl

if someone wants to test that image out, that would be useful feedback

Nyralei commented 3 weeks ago

@fosterseth It works, but now I encounter error with migrations mentioned in issue https://github.com/ansible/awx/issues/15137

fosterseth commented 3 weeks ago

@Nyralei thanks for quickly testing that.

I updated that image to pull in Alan's commit, if you want to re-pull and deploy again

Nyralei commented 2 weeks ago

@fosterseth Updating to 24.3.1 fixed the issue

ilbarone87 commented 2 weeks ago

Solved for me too. Will close the issue.