ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.04k stars 3.42k forks source link

awx-postgres-0 have CrashLoopBackOff STATUS. #9926

Open gymzang opened 3 years ago

gymzang commented 3 years ago
ISSUE TYPE
SUMMARY
ENVIRONMENT

Installed by following the guide below. https://github.com/ansible/awx/blob/devel/INSTALL.md

STEPS TO REPRODUCE

Hi. Yesterday I got an "Internal Server Error" from the AWX. I checked awx-postgres-0 have CrashLoopBackOff STATUS.

EXPECTED RESULTS

The awx-postgres-0 pod should also be in the Running state.

ACTUAL RESULTS

$ minikube kubectl get pods NAME READY STATUS RESTARTS AGE pod/awx-6f7bd969db-pcczn 4/4 Running 0 2m13s pod/awx-operator-57bcb58f5-5lzw9 1/1 Running 0 6m58s pod/awx-postgres-0 0/1 CrashLoopBackOff 4 2m19s

ADDITIONAL INFORMATION

I've got log from awx-postgres-0 initdb: error: directory "/var/lib/postgresql/data/pgdata" exists but is not empty

I think It's the same as the link below. https://groups.google.com/g/awx-project/c/9j81DcyeWJY

Isn't this solved? Please tell me how to fix it!

shanemcd commented 3 years ago

Can you post the output of:

kubectl describe pod/awx-postgres-0
gymzang commented 3 years ago

Thanks for the reply. The output is as follows.

# sudo minikube kubectl describe pod/awx-postgres-0 Name: awx-postgres-0 Namespace: default Priority: 0 Node: gerran-awx-test.novalocal/192.168.0.19 Start Time: Sat, 03 Apr 2021 11:00:32 +0900 Labels: app=awx-postgres controller-revision-hash=awx-postgres-566c99dd44 statefulset.kubernetes.io/pod-name=awx-postgres-0 Annotations: Status: Running IP: 172.17.0.5 IPs: IP: 172.17.0.5 Controlled By: StatefulSet/awx-postgres Containers: postgres: Container ID: docker://625cea3889c1bd2e08553e00c7a4514f0f8b65684e0f9f398adf9044bc13c43b Image: postgres:12 Image ID: docker-pullable://docker.io/postgres@sha256:2f3f78532c9cc5435d1cf9c8f5fb1409f9abd43e5b728371e4b031b1eac84b9a Port: 5432/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Mon, 19 Apr 2021 09:35:15 +0900 Finished: Mon, 19 Apr 2021 09:35:15 +0900 Ready: False Restart Count: 1673 Environment: POSTGRES_DB: <set to the key 'database' in secret 'awx-postgres-configuration'> Optional: false POSTGRES_USER: <set to the key 'username' in secret 'awx-postgres-configuration'> Optional: false POSTGRES_PASSWORD: <set to the key 'password' in secret 'awx-postgres-configuration'> Optional: false PGDATA: /var/lib/postgresql/data/pgdata POSTGRES_INITDB_ARGS: --auth-host=scram-sha-256 POSTGRES_HOST_AUTH_METHOD: scram-sha-256 Mounts: /var/lib/postgresql/data from postgres (rw,path="data") /var/run/secrets/kubernetes.io/serviceaccount from default-token-wjxlg (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: postgres: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: postgres-awx-postgres-0 ReadOnly: false default-token-wjxlg: Type: Secret (a volume populated by a Secret) SecretName: default-token-wjxlg Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Pulled 22m (x1669 over 5d22h) kubelet Container image "postgres:12" already present on machine Warning BackOff 2m49s (x39460 over 5d22h) kubelet Back-off restarting failed container

shanemcd commented 3 years ago

I'm not seeing anything obvious here. If this is a test environment, can you try again from the beginning after running:

$ minikube delete --purge
Achim-Hentschel commented 3 years ago

@shanemcd We are experiencing the same behaviour and I just tested your suggestion. Unfortunately the result is still the same. Logs are giving initdb: error: directory "/var/lib/postgresql/data/pgdata" exists but is not empty. Anything else I could test?

gymzang commented 3 years ago

@shanemcd When newly installed, there were no issues at first. The issue occurs after a few days .

gymzang commented 3 years ago

@shanemcd We are experiencing the same behaviour and I just tested your suggestion. Unfortunately the result is still the same. Logs are giving initdb: error: directory "/var/lib/postgresql/data/pgdata" exists but is not empty. Anything else I could test?

@Achim-Hentschel Have you solved this?

Achim-Hentschel commented 3 years ago

@gymzang Unfortunately not solved by us. I went back using 17.1.0 yesterday as 18.0.0 and 19.0.0 still seem very bleeding edge, but nor ready for production yet.

After I completely removed AWX 19.0.0 again (see below) I could successfully set up the system. But when I then tried to execute a task which just gathers hostnames from all hosts in our windows group, that task complained about missing dependencies (win_shell in our case). I then started digging into the concept of Execution Environments which seems quite new in AWX (introduced maybe in 18.0.0? Did not dig for the intro version though). I also tried to setup my own EE using the awx-ee project at version 0.2.0. I also failed with that - AWX does not seem to be able to use local docker images (named mine custom-awx-ee when building it with docker build -t custom-awx-ee . from the awx-ee repo. I then set up an EE in AWX 19 specifying exactly this image - when I executed the task the awx-operator or awx container created a new pod in kubernetes - fine so far. But although I specified not to pull the image in the EE setup kubernetes tried to pull it from an external repo - so working with local images does not work. Using the standard EE awx-ee 0.2.0 did not work either. Beyond that - awx-ee needs some documentation :)

We tried and failed a lot - and exactlty that is what makes me think, that AWX 18+ is not yet ready for production. It also feels quite rigid and hard to customize (especially if you are used to the old way - exec -it into the container and simply adding missing python or ansible-galaxy dependency). I understand - with the operator in place - why this has been done. Because it is very easy to loose such customizings. But for development this is a good way to first test things and then develop the final, working solution.

HaifengSun-Kira commented 3 years ago

I followed the latest install instrcution and faced the same bug. Is there any solution yet?

0x4e44 commented 3 years ago

got the same issue over on my end. This error lead to a final CreateContainerConfigError. With no restarts.

might be nice to know , this issue occured to me when the server was rebooted.

0x4e44 commented 3 years ago

@shanemcd When newly installed, there were no issues at first. The issue occurs after a few days .

or just a reboot.

AleksejEgorov commented 3 years ago

This problem is still repeating on AWX 19.2.1 deployed by AWX-operator 0.11.0.

# kubectl get pods
NAME                            READY   STATUS             RESTARTS   AGE
awx-866f569c74-7jjd8            4/4     Running            0          5m12s
awx-operator-765db9c478-2ztgr   1/1     Running            0          6m15s
awx-postgres-0                  0/1     CrashLoopBackOff   5          5m21s

# kubectl describe pod awx-postgres-0
Name:         awx-postgres-0
Namespace:    default
Priority:     0
Node:         iqkv-vm-ans-02.epk.local/10.200.0.153
Start Time:   Tue, 22 Jun 2021 14:18:11 +0300
Labels:       app.kubernetes.io/component=database
              app.kubernetes.io/instance=postgres-awx
              app.kubernetes.io/managed-by=awx-operator
              app.kubernetes.io/name=postgres
              app.kubernetes.io/part-of=awx
              controller-revision-hash=awx-postgres-78d8b767c8
              statefulset.kubernetes.io/pod-name=awx-postgres-0
Annotations:  <none>
Status:       Running
IP:           172.17.0.4
IPs:
  IP:           172.17.0.4
Controlled By:  StatefulSet/awx-postgres
Containers:
  postgres:
    Container ID:   docker://856968fd07f67841fe2cf7df5caa482ff81c9ed87494c59f661d88a779823812
    Image:          postgres:12
    Image ID:       docker-pullable://postgres@sha256:1ad9a00724bdd8d8da9f2d8a782021a8503eff908c9413b5b34f22d518088f26
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 22 Jun 2021 14:24:00 +0300
      Finished:     Tue, 22 Jun 2021 14:24:00 +0300
    Ready:          False
    Restart Count:  6
    Environment:
      POSTGRESQL_DATABASE:        <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_USER:            <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRESQL_PASSWORD:        <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_DB:                <set to the key 'database' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_USER:              <set to the key 'username' in secret 'awx-postgres-configuration'>  Optional: false
      POSTGRES_PASSWORD:          <set to the key 'password' in secret 'awx-postgres-configuration'>  Optional: false
      PGDATA:                     /var/lib/postgresql/data/pgdata
      POSTGRES_INITDB_ARGS:       --auth-host=scram-sha-256
      POSTGRES_HOST_AUTH_METHOD:  scram-sha-256
    Mounts:
      /var/lib/postgresql/data from postgres (rw,path="data")
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cltj7 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  postgres:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  postgres-awx-postgres-0
    ReadOnly:   false
  default-token-cltj7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cltj7
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  7m59s (x2 over 7m59s)   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         7m56s                   default-scheduler  Successfully assigned default/awx-postgres-0 to iqkv-vm-ans-02.epk.local
  Normal   Pulled            6m20s (x5 over 7m56s)   kubelet            Container image "postgres:12" already present on machine
  Normal   Created           6m20s (x5 over 7m56s)   kubelet            Created container postgres
  Normal   Started           6m20s (x5 over 7m56s)   kubelet            Started container postgres
  Warning  BackOff           2m55s (x25 over 7m54s)  kubelet            Back-off restarting failed container

# kubectl logs -f pods/awx-postgres-0
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

initdb: error: directory "/var/lib/postgresql/data/pgdata" exists but is not empty
If you want to create a new database system, either remove or empty
the directory "/var/lib/postgresql/data/pgdata" or run initdb
with an argument other than "/var/lib/postgresql/data/pgdata".
clementbey commented 3 years ago

Same with fresh install AWX operator 0.13.0 and AWX 19.3.0 No solution ?

skerbater commented 3 years ago

Same here... Same with fresh install AWX operator 0.13.0 and AWX 19.3.0 No solution ?

Bitos33 commented 3 years ago

Hello,

Is there any update on this issue ? I am trying to set a new AWX platform up and for some reason, the pods stop working after a couple of days When restarting minikube, the postgresql pod goes to CrashLoopBackOff state because it tries to init the database again instead of using the existing data The only solution is to purge everything and set it up again, same issue with 0.12 and 0.13 AWX operator There might something I miss but this is so frustrating ...

AleksejEgorov commented 3 years ago

Looks like the root of the problem is that the postgres data on the host is in the temp directory. Maybe it will be better to change folder through the deployment, but I only exclude default hostpath-provisioner directory from deleting:

cat <<EOF >/usr/lib/tmpfiles.d/minikube.conf
# Exclude minikube hostpath provisioner

x /tmp/hostpath-provisioner/default
X /tmp/hostpath-provisioner/default/*
EOF

Hope, it helps.

Bitos33 commented 3 years ago

Thank you for the proposition @AleksejEgorov I will try this out, as I lost the postgresql pod once again today

Lejooohn commented 2 years ago

Same issue here, after couple of days i restarded the VM and the Postgres pod enter in state "CrashLoopBackOff" with the same errors as previous explained :

initdb: error: directory "/var/lib/postgresql/data/pgdata" exists but is not empty If you want to create a new database system, either remove or empty the directory "/var/lib/postgresql/data/pgdata" or run initdb with an argument other than "/var/lib/postgresql/data/pgdata".

Lejooohn commented 2 years ago

any solution @shanemcd ? Even if i used minikube delete --purge, i got the same error when i tried to deploy again...

Jonathan-Caruana commented 2 years ago

Hi,

This error occur also with the latest version of awx-operator : 0.15.0 and awx : 19.5.0

After couple days if you restard your awx server the postgres pod enter in "CrashLoopBackOff" state :

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

initdb: error: directory "/var/lib/postgresql/data/pgdata" exists but is not empty
If you want to create a new database system, either remove or empty
the directory "/var/lib/postgresql/data/pgdata" or run initdb
with an argument other than "/var/lib/postgresql/data/pgdata".

No changes has been done previously.

Here is the describe output :

Name:         postgres-0
Namespace:    test
Priority:     0
Node:         XXXXX
Start Time:   Thu, 09 Dec 2021 10:40:32 +0100
Labels:       app.kubernetes.io/component=database
              app.kubernetes.io/instance=postgres-integration
              app.kubernetes.io/managed-by=awx-operator
              app.kubernetes.io/name=postgres
              app.kubernetes.io/part-of=integration
              controller-revision-hash=integration-postgres-5fbc5cf854
              statefulset.kubernetes.io/pod-name=integration-postgres-0
Annotations:  <none>
Status:       Running
IP:           172.17.0.5
IPs:
  IP:           172.17.0.5
Controlled By:  StatefulSet/integration-postgres
Containers:
  postgres:
    Container ID:   docker://2b5b03d387d2e525edae09aa84e2ff30923e16ab1b18c6bd5fcd3873dc0777b0
    Image:          postgres:12
    Image ID:       docker-pullable://postgres@sha256:0854202db0b3378c46909bab43a85b01dc1b92cc44520480e47dd4fbc22714ee
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 20 Dec 2021 15:07:05 +0100
      Finished:     Mon, 20 Dec 2021 15:07:05 +0100
    Ready:          False
    Restart Count:  48
    Environment:
      POSTGRESQL_DATABASE:        <set to the key 'database' in secret 'test-postgres-configuration'>  Optional: false
      POSTGRESQL_USER:            <set to the key 'username' in secret 'test-postgres-configuration'>  Optional: false
      POSTGRESQL_PASSWORD:        <set to the key 'password' in secret 'test-postgres-configuration'>  Optional: false
      POSTGRES_DB:                <set to the key 'database' in secret 'test-postgres-configuration'>  Optional: false
      POSTGRES_USER:              <set to the key 'username' in secret 'test-postgres-configuration'>  Optional: false
      POSTGRES_PASSWORD:          <set to the key 'password' in secret 'test-postgres-configuration'>  Optional: false
      PGDATA:                     /var/lib/postgresql/data/pgdata
      POSTGRES_INITDB_ARGS:       --auth-host=scram-sha-256
      POSTGRES_HOST_AUTH_METHOD:  scram-sha-256
    Mounts:
      /var/lib/postgresql/data from postgres (rw,path="data")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2vtvz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  postgres:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  postgres-test-postgres-0
    ReadOnly:   false
  kube-api-access-2vtvz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                      From     Message
  ----     ------   ----                     ----     -------
  Normal   Pulled   7m17s (x43 over 3h32m)   kubelet  Container image "postgres:12" already present on machine
  Warning  BackOff  2m17s (x908 over 3h17m)  kubelet  Back-off restarting failed container

I can't think about put this product in production if this behaviour occur frequently (and as i saw, many users report this bug).

Any help will be appreciaced

Bitos33 commented 2 years ago

As mentioned by @AleksejEgorov, the trick is to disable auto cleaning of the temporary directory where data files are stored I got no more crash since i did this I agree this is only a workaround and cannot be acceptable for production The real question is: is it possible to configure the path for database pod and use a safe place ?

robinduerhager commented 2 years ago

The Workaround of @AleksejEgorov didn't work out for me. My PostgreSQL Container crashed today. In https://github.com/docker-library/postgres/issues/263 it is stated, that this behavior occurs if the PGADMIN environment variable is not set, so Postgres tries to initiate a new Database. When the env variable is set, Postgres will skip this step.

However, a docker inspect <postgresql container ID> showed, that the environment variable is already set correctly (regarding my deployment at least).

My System: awx-operator: 0.16.0 AWX: 19.5.1 Rocky-Linux 8.5 with minikube and kubectl

Jonathan-Caruana commented 2 years ago

@robinduerhager

Have you deployed your pods on default namespace? If not (like me i use "integration" namespace) i had to modify the exemple provide by @AleksejEgorov like that :

cat usr/lib/tmpfiles.d/minikube.conf 
# Exclude minikube hostpath provisioner

x /tmp/hostpath-provisioner/default
X /tmp/hostpath-provisioner/default/*
x /tmp/hostpath-provisioner/integration
X /tmp/hostpath-provisioner/integration/*

Regards,

robinduerhager commented 2 years ago

Thank you for the hint @Jonathan-Caruana, didn't know about this. I will test it out immediately :)!

Jonathan-Caruana commented 2 years ago

Hi team,

Any news concerning this issue? It will be fixed in the next releases?

Regards,

mickael-decastro commented 2 years ago

Hi,

Any news about this issue ?

Best Regards,

Jonathan-Caruana commented 2 years ago

Hi @mickael-decastro

I think this issue is not resolved yet but on my side to avoid it i have connected awx pod to an external PostgreSQL (on the same server as well).

If it can help

gymzang commented 1 year ago

Hi guys. I comment because people keep asking me how to solve it. I just downgraded the version to 17.1.0. And I've been using it for a year without issues.

https://github.com/ansible/awx/releases Of course, the current latest version is 21.9.0. I see bug fixes and functional upgrades, However, it is being used well in 17.1.0 without any problems or functions. Regards,