Latest awx-ee image doesn't seem to want to start?

liquidspikes commented 2 years ago

Hello,

I have been playing a bit with awx-operator (0.16.1) and awx-web (19.5.1) and recently went through the process of deleting my pods and pulling the awx-ee:latest image (which appears was updated about 8 hours ago, manifest: ba162e341631).

Unfortunately, I get the following error when the awx-ee container attempts to start:

kubectl logs awx-7c7f97bf75-dpq4t -c awx-ee -n awx
panic: qtls.ClientHelloInfo doesn't match

goroutine 1 [running]:
github.com/marten-seemann/qtls-go1-15.init.0()
        /root/go/pkg/mod/github.com/marten-seemann/qtls-go1-15@v0.1.0/unsafe.go:20 +0x132

This causes the AWX pod to fail in a CrashLoop BackOff State.

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Created    36m                  kubelet            Created container awx-task
  Normal   Started    36m                  kubelet            Started container awx-task
  Normal   Created    36m                  kubelet            Created container awx-web
  Normal   Pulling    36m (x2 over 36m)    kubelet            Pulling image "quay.io/ansible/awx-ee:latest"
  Normal   Created    36m (x2 over 36m)    kubelet            Created container awx-ee
  Normal   Started    36m (x2 over 36m)    kubelet            Started container awx-ee
  Normal   Pulled     36m                  kubelet            Successfully pulled image "quay.io/ansible/awx-ee:latest" in 315.368081ms
  Warning  BackOff    60s (x162 over 36m)  kubelet            Back-off restarting failed container`

kubectl get pods -n awx
NAME                                               READY   STATUS             RESTARTS         AGE
awx-operator-controller-manager-6c96d9b446-tczbn   2/2     Running            4 (102m ago)     24h
awx-7c7f97bf75-dpq4t                            3/4     CrashLoopBackOff   23 (3m42s ago)   97m

I am running a bare-metal Kubernetes cluster on Ubuntu 20.04 LTS.

Please let me know if there is anything else I can do to assist. :)

Thanks! :)

liquidspikes commented 2 years ago

I modified the pod spec to point to the previous awx-ee:0.6.0 image and it seems to work great. I believe the latest image just got hosed! Bad upload?

garyhodgson commented 2 years ago

I'm seeing the same problem, and in case others have the same difficulty as I did, it is the "control_plane_ee_image" spec param that has to be explicitly given, not the one under ee_images, e.g.

spec:
  image_pull_policy: Always
  control_plane_ee_image: quay.io/ansible/awx-ee:0.6.0
  ee_images:
    - name: my-awx-ee
      image: ...

markstoel commented 2 years ago

@garyhodgson , @liquidspikes Which version of the operator are you using? Because with the latest version and awx-ee:0.6.0 it will not start any job pods..

garyhodgson commented 2 years ago

@markstoel quay.io/ansible/awx-operator:0.16.1 - but I only got to the stage of starting AWX - I haven't yet attempted to start a job. Once I do I will reply here.

garyhodgson commented 2 years ago

@markstoel I believe I am seeing the same problem. the Source Control Update Job is hanging with status Pending, and no job pod is started.

mario-oberwalder commented 2 years ago

I believe we have the same problem on Openshift 4.8.22

kdelee commented 2 years ago

Looks like it is busted because of receptor podman run --rm quay.io/ansible/awx-ee:latest receptor --version

panic: qtls.ClientHelloInfo doesn't match

goroutine 1 [running]:
github.com/marten-seemann/qtls-go1-15.init.0()
        /root/go/pkg/mod/github.com/marten-seemann/qtls-go1-15@v0.1.0/unsafe.go:20 +0x132

People need to pin to 0.6.0 until receptor is fixed...going over there to see whats up

AlanCoding commented 2 years ago

I believe we may be seeing the same problem with AAP controller VM installs running container group jobs. Watching this.

shanemcd commented 2 years ago

New image has been pushed up which will hopefully stop the bleeding. Will wait for folks to confirm before closing this.

chofstede commented 2 years ago

New version does not start at all. Error in Kubernetes:

Error: cannot find volume "awx-receptor-config" to mount into container "awx-ee"

kdelee commented 2 years ago

@chofstede its working for me with this as my awx CR

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx-prod
  namespace: awx
spec:
  admin_email: admin@localhost
  admin_user: admin
  create_preload_data: true
  garbage_collect_secrets: false
  image: quay.io/ansible/awx
  image_pull_policy: Always
  image_version: latest
  loadbalancer_port: 80
  loadbalancer_protocol: http
  nodeport_port: 30080
  projects_persistence: false
  projects_storage_access_mode: ReadWriteMany
  projects_storage_size: 8Gi
  replicas: 1
  route_tls_termination_mechanism: Edge
  service_type: nodeport
  task_privileged: false

that was a fresh deployment. My previous deployment that had been crash looping on trying to start the awx-ee fixed itself w/o restarting the pod because it pulled the new :latest tag and the container was able to start

liquidspikes commented 2 years ago

@garyhodgson , @liquidspikes Which version of the operator are you using? Because with the latest version and awx-ee:0.6.0 it will not start any job pods..

You are correct, I just saw all the pods start and the awx-web start successfully, got a little overhappy.

New image has been pushed up which will hopefully stop the bleeding. Will wait for folks to confirm before closing this.

Deleted my pods and the pulled the awx-ee:latest and it appears to be up, thank you @shanemcd and @kdelee!

alex-kalinowski commented 2 years ago

I am unable to start any jobs with either the updated latest image or pinning to 0.6.0.

I'm deployed in k8s using the awx operator and noticed the task container is unable to run_dispatcher with the following error:

2022-02-03 17:06:27,633 WARNING  [-] awx.main.dispatch.periodic periodic beat started
Instance Group already registered controlplane
Instance Group already registered default
2022-02-03 17:06:27,676 DEBUG    [-] awx.main.dispatch scaling up worker pid:9403
2022-02-03 17:06:27,683 DEBUG    [-] awx.main.dispatch scaling up worker pid:9405
2022-02-03 17:06:27,689 DEBUG    [-] awx.main.dispatch scaling up worker pid:9406
2022-02-03 17:06:27,695 DEBUG    [-] awx.main.dispatch scaling up worker pid:9407
2022-02-03 17:06:27,698 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'awx-7bf495dd7c-cgkdl']
2022-02-03 17:06:27,707 DEBUG    [-] awx.main.tasks Syncing Schedules
2022-02-03 17:06:28,224 DEBUG    [-] awx.main.tasks.system Waited 0.0011806488037109375 seconds to obtain lock name: cluster_policy_lock
2022-02-03 17:06:28,237 DEBUG    [-] awx.main.tasks.system Total instances: 9, available for policy: 9
2022-02-03 17:06:28,240 DEBUG    [-] awx.main.tasks.system Policy percentage, adding Instances [41, 42, 43, 44, 45, 46, 47, 48, 49] to Group controlplane
2022-02-03 17:06:28,240 DEBUG    [-] awx.main.tasks.system Cluster policy no-op finished in 0.0156710147857666 seconds
2022-02-03 17:06:28,244 DEBUG    [-] awx.main.tasks.system Cluster node heartbeat task.
Traceback (most recent call last):
  File "/usr/local/bin/awx-manage", line 9, in <module>
    load_entry_point('awx', 'console_scripts', 'awx-manage')()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/__init__.py", line 171, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/management/commands/run_dispatcher.py", line 62, in handle
    consumer.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/base.py", line 149, in run
    self.worker.on_start()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/dispatch/worker/task.py", line 128, in on_start
    dispatch_startup()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks/system.py", line 104, in dispatch_startup
    cluster_node_heartbeat()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks/system.py", line 492, in cluster_node_heartbeat
    inspect_execution_nodes(instance_list)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks/system.py", line 439, in inspect_execution_nodes
    if not any(cmd['WorkType'] == 'ansible-runner' for cmd in ad['WorkCommands'] or []):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks/system.py", line 439, in <genexpr>
    if not any(cmd['WorkType'] == 'ansible-runner' for cmd in ad['WorkCommands'] or []):
TypeError: string indices must be integers

Does AWX need an upgrade to compat with this?

chofstede commented 2 years ago

After deleting the Pod and let it re-deploy one more time, I can now confirm that AWX is working again with awx-ee:latest

ansible / awx-ee

Latest awx-ee image doesn't seem to want to start? #101