Open sylvain-de-fuster opened 6 months ago
@sylvain-de-fuster there was further work around Postgres done in recent AWX releases. Can you upgrade to the latest AWX release and see if that resolves the issue?
Hello,
Thanks for your answer.
I did update to 24.2.0. For the work on postgres: did you mean after this version ? At first sight, I don't see postgres related changelog on 24.3.0 or 24.3.1.
Anyway, I updated to the latest (24.3.1 at current time) and below my checks :
• Recurrent error in task container of awx task pod. [...] min_value in DecimalField should be Decimal type. [...]
Only that line. No other informations found about it. I don't really know where to search. The error appears without activity on awx. The line is also present in web container logs of the web pod but not with the same regularity. It is not specifically related to my checks but I did notice it so FYI.
• Switchover tests AWX restart the container under the web pod (It didn't on 24.2.0) The "simple task playbook case" (sleep command on localhost) is failing now. (It didn't on 24.2.0)
See le awx_web_container logs.
• Fake crash Same behavior as before.
Please confirm the following
Bug Summary
Our environment :
We are working on our dev plateform to give some muscles to it. The final goal is to have a awx plateform more reactive, available and manageable to face growing usage, maintenance needs and potentially incidents.
• First step To avoid having issues fixed by newer versions, we started by updating our infra from 23.3.0 to 24.2.0 and reinstall our external workers with receptor 1.4.5.
• Second step Externalize our postgres database. The externalization itself was pretty easy. We are using a patroni cluster (v3.2.2) of two postgres instances (v16.1) (one leader and one streaming replica) with a VIP on top.
Our issues are in our behavior tests : We experienced bad behaviors during switchover and failover.
After checking https://github.com/ansible/awx/issues/13505 and https://github.com/ansible/awx-operator/pull/1393, we used the corresponding keepalive postgres parameters. It helped a lot but there are still some issues.
Our current tests (each sub tests were tested with a healthy plateform) :
• Switchover
Without activity
Simple activity in progress (one sleep task on localhost)
"Complex" activity in progress (one job with multiples tasks)
• Crash simulation of the postgres leader
Without activity
Simple activity in progress (one sleep task on localhost)
"Complex" activity in progress (one job with multiples tasks)
Connection doesn't seems to be reestablished automatically. There is no restart of container. A rollout restart of the task pod give back the ability the execute new jobs. The healthchecks previously launched are still hanging (health_check_pending true). The only way I found to fix this hang issue is to reinstall the instance. If no healthcheck was done previously on a worker, a new healthcheck on it works and ends correctly.
After checking the multiple logs, I can't find why new jobs can't be launched and why healthcheck hangs indefinitely.
AWX Operator version
2.7.0 and 2.15.0
AWX version
23.3.0 and 24.2.0
Kubernetes platform
kubernetes
Kubernetes/Platform version
k3s v1.25.4+k3s1
Modifications
no
Steps to reproduce
Expected results
Actual results
Additional information
No response
Operator Logs
No response