ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14k stars 3.42k forks source link

[BUG?] ERROR: work type did not expect a signature when running health check in AWX with work-kubernetes in Receptor #14849

Open diademiemi opened 8 months ago

diademiemi commented 8 months ago

Please confirm the following

Bug Summary

When using the work-kubernetes type as described in the documentation, we get the following error when checking the health of the node from AWX.

ERROR 2024/02/07 00:40:42 : work type did not expect a signature

image

Does AWX not support the work-kubernetes type yet and the health check is not reporting a readable error for this? The error is quite vague and I'm not sure what the issue is.

Our goal here is to run Receptor in a Kubernetes cluster so we can host execution and/or hop nodes in Kubernetes. I'm not certain whether this is an issue in AWX or in Receptor.

AWX version

23.7.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Ubuntu 22.04

Web browser

No response

Steps to reproduce

The following receptor config is used:

receptor.conf ```yaml --- - node: id: 192.168.21.54 - work-verification: publickey: /etc/receptor/work_public_key.pem - log-level: debug - control-service: service: control filename: /tmp/receptor.sock permissions: 0660 tls: tls_server - tls-server: name: tls_server cert: /etc/receptor/tls/receptor.crt key: /etc/receptor/tls/receptor.key clientcas: /etc/receptor/tls/ca/mesh-CA.crt requireclientcert: true mintls13: False - tls-client: name: tls_client cert: /etc/receptor/tls/receptor.crt key: /etc/receptor/tls/receptor.key rootcas: /etc/receptor/tls/ca/mesh-CA.crt insecureskipverify: false mintls13: False - tcp-listener: port: 27199 tls: tls_server - work-kubernetes: worktype: kubeit authmethod: kubeconfig allowruntimeauth: true allowruntimepod: true allowruntimeparams: true verifysignature: true ```

After starting Receptor and checking the health of the instance, I get the error.

Expected results

AWX should succeed the health check and use Receptor to run workloads on the Kubernetes cluster with the kubeit worktype.

If this is not a supported usecase yet, I would expect a clearer error message. This error message seems quite arbitrary to me and confused us for days.

Actual results

We get an error

ERROR 2024/02/07 00:40:42 : work type did not expect a signature

This does not seem relevant to what we are trying to achieve. I looked through the code to see what causes this and it seems to be related to the health check not using the correct work type (more information later).

Additional information

It seems this error occurs due to the workType being given as ansible-runner instead of kubeit. I'm not too familiar with the code at work here, but I added some debug statements in Receptor.

func (c *workceptorCommand) processSignature(workType, signature string, connIsUnix, signWork bool) error {
    shouldVerifySignature := c.w.ShouldVerifySignature(workType, signWork)
    fmt.Print("shouldVerifySignature: ", shouldVerifySignature)
    fmt.Print("workType: ", workType)
    fmt.Print("connIsUnix: ", connIsUnix)

    if !shouldVerifySignature && signature != "" {
        return fmt.Errorf("work type did not expect a signature")
    }
    if shouldVerifySignature && !connIsUnix {
        err := c.w.VerifySignature(signature)
        if err != nil {
            return err
        }
    }

    return nil
shouldVerifySignature: false
workType: ansible-runner
connIsUnix: false

And in ShouldVerifySignature

func (w *Workceptor) ShouldVerifySignature(workType string, signWork bool) bool {
    // if work unit is remote, just get the signWork boolean from the
    // remote extra data field
    if workType == "remote" {
        return signWork
    }
    w.workTypesLock.RLock()
    fmt.Print("w: ", w, " workTypes: ", w.workTypes, "\n")

    wt, ok := w.workTypes[workType]
    w.workTypesLock.RUnlock()
    fmt.Print("w: ", w, " wt: ", wt, " ok: ", ok, "\n")

    if ok && wt.verifySignature {
        return true
    }

    return false
}
workTypes: map[kubeit:0xc000379e80 remote:0xc000379620]
w: &{0xc0002adb30 0x495fe0 0xc0001ef880 /tmp/receptor/192.168.21.54 0xc0000a74d0 map[kubeit:0xc000379e80 remote:0xc000379620] 0xc0000a74e8 map[]  5m0s /etc/receptor/work_public_key.pem}
wt: <nil>
ok: false

Am I correct here in that it seems like it thinks the only valid workTypes are kubeit and remote here but AWX is sending ansible-runner for the health check?

kurokobo commented 8 months ago

@diademiemi Hi,

Our goal here is to run Receptor in a Kubernetes cluster so we can host execution and/or hop nodes in Kubernetes.

The current AWX implementation assumes that the execution nodes are running as the hosts where Ansible Runner is running locally and Podman is installed. So in the first place it's hard to run execution nodes in Kubernetes cluster since if we select execution nodes for some job templates AWX sends request to ansible runner on the execution nodes to run execition environment by creating container on the Podman, instead of Kubernetes.

Alternatively, I recommend you this to achieve similar goals; we can define Container Group with credentials for the remote Kubernetes cluster. This allows us to run EE on remote Kubernetes cluster: https://ansible.readthedocs.io/projects/awx/en/latest/administration/containers_instance_groups.html#create-a-container-group

Running hop node on Kubernetes cluster is not so hard, since hop node never be used to invoke any commands. No podman nor ansible runner are required. In addition, the feature "in-cluster hop node" called AWXMeshIngress will be implemented in the next release: https://github.com/ansible/awx/pull/14640

Here are my answer for your questions for your technical interest:

If you have further insterest, my blog article may helps you (sorry it is in Japanese, so please use some translator): https://blog.kurokobo.com/archives/4847 Or ask further questions on the forum: https://forum.ansible.com/

kurokobo commented 8 months ago

It would be appropriate to improve the error message, perhaps in an enhancement request on the Receptor side.

fosterseth commented 8 months ago

as @kurokobo mentioned, container groups are designed to achieve running jobs on remote k8s clusters

AWX expects execution node to have a work-command called ansible-runner for health checks

but when running jobs, AWX also uses this same work command. So even if you have a proper kubeit work-kubernetes setup in the config, AWX is not going to utilize it sadly. That would require a bit of changes in AWX to get that working.

Is there a use case for this that container groups doesn't cover?

diademiemi commented 8 months ago

Thank you for the detailed response! I understand a lot better now what this is doing under the hood

I'll be checking out the AWXMeshIngress and Container Groups feature today and tomorrow and I'll get back to you for if this covers our usecase.