argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

Server `CrashLoopBackOff` while calling `log` API without `container`: `panic: runtime error: invalid memory address or nil pointer dereference` #13585

Open cdtzabra opened 2 months ago

cdtzabra commented 2 months ago

Pre-requisites

What happened? What did you expect to happen?

By trying to get a workflow logs with the API endpoint api/workflow/namespace/name/log using curl, I received a reply 502 Bad Gateway. I checked the pod status and found that the pod servers are CrashLoopBackOff followed by a restart.

 curl -H "Authorization: $ARGO_TOKEN" -X GET http://workflows.example.com/api/v1/workflows/argo-workflows/workflow-controlm-3oj4k/log
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
k get po

NAME                                                  READY   STATUS             RESTARTS     AGE
argo-workflows-server-587c6d6f58-f8w2j                0/1     CrashLoopBackOff   3 (7s ago)   8d
argo-workflows-server-587c6d6f58-nfjbf                0/1     CrashLoopBackOff   3 (8s ago)   8d
argo-workflows-workflow-controller-5fd7d444fc-cfxmz   1/1     Running            0            8d
workflow-controlm-3oj4k-build-4246951965              0/3     Completed          0            113m
workflow-controlm-3oj4k-create-job-1156902314         0/1     Completed          0            112m

k get po
NAME                                                  READY   STATUS      RESTARTS        AGE
argo-workflows-server-587c6d6f58-f8w2j                1/1     Running     4 (2m19s ago)   8d
argo-workflows-server-587c6d6f58-nfjbf                1/1     Running     4 (2m20s ago)   8d
argo-workflows-workflow-controller-5fd7d444fc-cfxmz   1/1     Running     0               8d
workflow-controlm-3oj4k-build-4246951965              0/3     Completed   0               115m
workflow-controlm-3oj4k-create-job-1156902314         0/1     Completed   0               114m

And looking at the previous logs, we can clearly see that the pod has panic: runtime error: invalid memory address or nil pointer dereference because of the container parameter missing from the request.

time="2024-09-10T13:12:34.206Z" level=info duration=13.270867ms method=GET path=/api/v1/workflows/argo-workflows/workflow-controlm-3oj4k size=18650 status=0
time="2024-09-10T13:12:39.588Z" level=info duration="158.928µs" method=GET path=index.html size=487 status=0
time="2024-09-10T13:12:52.924Z" level=error msg="a container name must be specified for pod workflow-controlm-3oj4k-build-4246951965, choose one of: [init wait prerequisistes main]" namespace=argo-workflows podName=workflow-controlm-3oj4k-build-4246951965 workflow=workflow-controlm-3oj4k
time="2024-09-10T13:12:52.925Z" level=error msg="a container name must be specified for pod workflow-controlm-3oj4k-notify-862466002, choose one of: [init wait main]" namespace=argo-workflows podName=workflow-controlm-3oj4k-notify-862466002 workflow=workflow-controlm-3oj4k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1e8c8a4]

When I specify the container ?logOptions.container=main it works fine

However, I think that the argo-workflows-server pod should not CrashLoopBackOff for this reason. Otherwise it would be very impactful in production if simple bad API calls crash the whole argo-worflows

Version(s)

v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: graph-
  labels:
    workflows.argoproj.io/test: "true"
  annotations:
    workflows.argoproj.io/description: |
      This workflow demonstrates running a graph of tasks within containers in a single pod.
    workflows.argoproj.io/version: ">= 3.1.0"
spec:
  entrypoint: main
  templates:
    - name: main
      containerSet:
        containers:
          - name: a
            image: argoproj/argosay:v2
          - name: b
            image: argoproj/argosay:v2
            dependencies: ["a"]
          - name: c
            image: argoproj/argosay:v2
            dependencies: ["a"]
          - name: main
            image: argoproj/argosay:v2
            dependencies: ["b", "c"]

Logs from the workflow controller

kubectl logs -n argo-workflows deploy/argo-workflows-workflow-controller | grep ${workflow}
Found 2 pods, using pod/argo-workflows-workflow-controller-5fd7d444fc-dk66f

kubectl logs -n argo-workflows argo-workflows-workflow-controller-xxx-cfxmz  | grep ${workflow}

kubectl logs -n argo-workflows argo-workflows-workflow-controller-xxx-dk66f  | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo-workflows -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
error: container wait is not valid for pod workflow-controlm-3oj4k-create-job-1156902314
agilgur5 commented 2 months ago

And looking at the previous logs, we can clearly see that the pod has panic: runtime error: invalid memory address or nil pointer dereference because of the container parameter missing from the request.

There should be a stack trace following that log pointing to the exact line of the nil pointer

Otherwise it would be very impactful in production if simple bad API calls crash the whole argo-worflows [sic]

It normally doesn't; I'm suspecting there's a missing recover statement somewhere

cdtzabra commented 2 months ago

There should be a stack trace following that log pointing to the exact line of the nil pointer

there was no more log concerning the error unfortunately.

isubasinghe commented 2 months ago

Can you run https://pkg.go.dev/cmd/addr2line with the program counter as an argument along with the binary?

djanjic commented 2 months ago

Hello, can I work on this one?

tooptoop4 commented 2 weeks ago

@cdtzabra can u run with debug log level turned on? i also raized https://github.com/argoproj/argo-workflows/pull/13866 if u wanna give that a whirl

cdtzabra commented 2 weeks ago

@cdtzabra can u run with debug log level turned on? i also raized #13866 if u wanna give that a whirl

Hi,

Yes, I can. Next week

cdtzabra commented 1 week ago

@tooptoop4

I've just tested . I first tested in lab and I could not reproduce the problem despite several times. I removed the resource requests/limits that had been added and still no crashloopback.

So I switched to the pre-production argo-workflows (where I'd encountered the problem), I tested our real workflowTemplate and there was the CrashLoopBackOff.

By re-running the tests with the basic worklowtemplates tested in LAB, there's no problem.

I put the preprod PODS in debug mode, I redid the crash test but there's no more info in the log.

Note for the basic templates, the msg="a container name must be specified for pod workflow xxx is still present in server log BUT there is no crash

    spec:
      containers:
      - args:
        - server
        - --configmap=argo-workflows-workflow-controller-configmap
        - --auth-mode=client
        - --auth-mode=sso
        - --secure=false
        - --namespaced
        - --loglevel
        - debug
        - --gloglevel
        - "0"
        - --log-format
        - text

workflow server pod1 log

time="2024-11-18T10:16:33.744Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetWorkflowTemplate grpc.service=workflowtemplate.WorkflowTemplateService grpc.start_time="2024-11-18T10:16:33Z" grpc.time_ms=13.259 span.kind=server system=grpc
time="2024-11-18T10:16:33.747Z" level=info duration=17.33487ms method=GET path=/api/v1/workflow-templates/argo-workflows/workflow-controlm size=6690 status=0
time="2024-11-18T10:16:48.263Z" level=info duration="112.732µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:17:08.263Z" level=info duration="181.525µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:17:28.263Z" level=info duration="146.051µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:17:48.264Z" level=info duration="139.866µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:18:08.264Z" level=info duration="347.373µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:18:28.263Z" level=info duration="315.198µs" method=GET path=index.html size=487 status=0
time="2024-11-18T10:18:41.558Z" level=debug msg="List options" namespace=argo-workflows options="{{ } workflows.argoproj.io/workflow=workflow-controlm-44vpo  false false   <nil> 0 }" workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.558Z" level=debug msg="Log options" namespace=argo-workflows options="&PodLogOptions{Container:,Follow:false,Previous:false,SinceSeconds:nil,SinceTime:<nil>,Timestamps:false,TailLines:nil,LimitBytes:nil,InsecureSkipTLSVerifyBackend:false,}" workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.560Z" level=debug msg="Listing workflow pods" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Ensuring pod logs stream" alreadyStreaming=false namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 podPhase=Succeeded workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Ensuring pod logs stream" alreadyStreaming=false namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 podPhase=Succeeded workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Ensuring pod logs stream" alreadyStreaming=false namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 podPhase=Succeeded workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Not starting watches" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Waiting for work-group" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Sorting entries" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.570Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.575Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-notify-204188948, choose one of: [init wait main]" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.575Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.576Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-build-1554548131, choose one of: [init wait prerequisistes main]" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.576Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.594Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1e8c8a4]

goroutine 373 [running]:
github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1.1({0xc0007879e0, 0x2d})
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:167 +0x524
created by github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1 in goroutine 224
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:126 +0x5e5

workflow server pod2 log

time="2024-11-18T10:18:41.650Z" level=debug msg="Not starting watches" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Waiting for work-group" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Sorting entries" namespace=argo-workflows workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.650Z" level=debug msg="Streaming pod logs" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.653Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-notify-204188948, choose one of: [init wait main]" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.653Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-notify-204188948 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.654Z" level=error msg="a container name must be specified for pod workflow-controlm-44vpo-build-1554548131, choose one of: [init wait prerequisistes main]" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.654Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-build-1554548131 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.669Z" level=debug msg="Pod logs stream done" namespace=argo-workflows podName=workflow-controlm-44vpo-create-job-3142162940 workflow=workflow-controlm-44vpo
time="2024-11-18T10:18:41.670Z" level=debug msg="Work-group done" namespace=argo-workflows workflow=workflow-controlm-44vpo
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x1e8c8a4]

goroutine 287 [running]:
github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1.1({0xc00054a5d0, 0x2d})
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:167 +0x524
created by github.com/argoproj/argo-workflows/v3/util/logs.WorkflowLogs.func1 in goroutine 294
        /go/src/github.com/argoproj/argo-workflows/util/logs/workflow-logger.go:126 +0x5e5

So I wonder if the problem isn't related to the resource types in the template. The workflowTemplate that causes the workflow-server to crash also uses “artificat”, “resource” (to create a kubernetes job) and more

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: workflow-controlm
  namespace: argo-workflows
spec:
  ttlStrategy:
    # keep completed workflows for 1d
    secondsAfterCompletion: 86400
  workflowMetadata:
    annotations:
      workflows.argoproj.io/title: "{{workflow.parameters.DISPLAY_NAME}}"
    labels:
      workflows.argoproj.io/title: "{{workflow.parameters.DISPLAY_NAME}}"
  entrypoint: main-template
  onExit: exit-handler
  serviceAccountName: workflow-pods-sa
  arguments:
    parameters:
      - name: PROJECT_URL
        value: ""
      - name: PROJECT_BRANCH
        value: ""
      - name: CONTROLM_JOB
        value: ""
      - name: CONTROLM_TASK
        value: ""
      - name: CONTROLM_CTMS
        value: ""
      - name: TEMPLATE_NAME
        value: ""
      - name: BUILD_URL
        value: ""
      - name: DISPLAY_NAME
        value: "to-overload"
  volumes:
    - name: workspace
      emptyDir: {}
  templates:
    - name: main-template
      steps:
        - - name: build
            template: build
            arguments:
              parameters:
                - name: PROJECT_URL
                  value: "{{workflow.parameters.PROJECT_URL}}"
                - name: PROJECT_BRANCH
                  value: "{{workflow.parameters.PROJECT_BRANCH}}"
                - name: CONTROLM_TASK
                  value: "{{workflow.parameters.CONTROLM_TASK}}"
                - name: TEMPLATE_NAME
                  value: "{{workflow.parameters.TEMPLATE_NAME}}"

        - - name: create-job
            template: create-job
            arguments:
              artifacts:
              - name: kubejob
                raw:
                  data: |
                    {{workflow.outputs.parameters.KUBEJOB}}

    # Steps template definition: Build
    - name: build
      inputs:
        parameters:
          - name: PROJECT_URL
          - name: PROJECT_BRANCH
          - name: CONTROLM_TASK
          - name: TEMPLATE_NAME
      outputs:
        parameters:
          - name: namespace
            valueFrom:
              path: /workspace/namespace.txt
            globalName: NAMESPACE
          - name: kube_job
            valueFrom:
              path: /workspace/kube-job.yaml
            globalName: KUBEJOB
          - name: backofflimit
            valueFrom:
              path: /workspace/backofflimit.txt
            globalName: backoffLimit           
      containerSet:
        volumeMounts:
          - name: workspace
            mountPath: /workspace
        containers:
          - name: prerequisistes
            image: myregistry/tools:latest
            envFrom:
              - secretRef:
                  name: gitlab-a1963-read
            command:
              - sh
              - '-c'
              - |
                set -o errexit
                cd /workspace

                gitUrl="{{inputs.parameters.PROJECT_URL}}"
                gitBranch="{{inputs.parameters.PROJECT_BRANCH}}"
                codeApp=$(basename $(dirname {{inputs.parameters.PROJECT_URL}}))
                echo "${codeApp}-${gitBranch}" > namespace.txt

                git clone -b ${gitBranch} ${gitUrl} chart

          - name: main
            dependencies:
              - prerequisistes
            image: myregistry/tools:latest
            command:
              - sh
              - '-c'
              - |
                set -o errexit
                cd /workspace/

                printf "\n### Templating the Helm Chart ###\n"
                taskPath="chart/{{inputs.parameters.CONTROLM_TASK}}"
                templateName="{{inputs.parameters.TEMPLATE_NAME}}"
                outputDir="./myfolder"

                helm template ${taskPath}/${templateName}/ -f ${taskPath}/app-values.yaml -f ${taskPath}/infra-values.yaml --output-dir ${outputDir} --dependency-update

                # Store data to export them as global variables
                find ${outputDir}  -type f -iname "job*.yaml" | xargs cat > kube-job.yaml
                yq .spec.backoffLimit kube-job.yaml > backofflimit.txt

                namespace=$(cat namespace.txt)
                get_ns=$(kubectl get -o name --ignore-not-found ns ${namespace})
                if [ -z $get_ns ]; then kubectl create ns ${namespace}; fi

                printf "\n### Creating configmaps - Secrets - AVP ###\n"
                find ${outputDir}  -type f -iname "configmap-*.yaml" -o -iname "secret*.yaml" | while read -r line; do 
                cat  ${line} | | kubectl -n ${namespace} apply -f -
                done

    # Steps template definition: Argo App
    - name: create-job
      inputs:
        artifacts:
        - name: kubejob
          path: /artifacts/kubejob
      resource:
        action: create
        successCondition: status.succeeded > 0
        failureCondition: status.failed > {{workflow.outputs.parameters.backoffLimit}}
        flags: ["--namespace={{workflow.outputs.parameters.NAMESPACE}}"]
        manifest: |
          {{workflow.outputs.parameters.KUBEJOB}}
      outputs:
        parameters:
        - name: job-status
          valueFrom:
            jsonPath: '{.status}'

    # Exit handler templates
    # After the completion of the entrypoint template
    - name: exit-handler
      steps:
      - - name:  Success-Notifier
          template: notify
          arguments:
            parameters:
              - name: exitCode
                value: 0
          when: "{{workflow.status}} == Succeeded && {{workflow.parameters.PROJECT_BRANCH}} != dev"
        - name: Failure-Notifier
          template: notify
          arguments:
            parameters:
              - name: exitCode
                value: 1
          when: "{{workflow.status}} != Succeeded && {{workflow.parameters.PROJECT_BRANCH}} != dev"

    # notify template definition
    - name: notify
      inputs:
        parameters:
          - name: exitCode
      container:
        image: alpine/curl
        envFrom:
          - secretRef:
              name: controlm
        command:
          - sh
          - '-c'
          - |
            exitCode="{{inputs.parameters.exitCode}}"
            job="{{workflow.parameters.CONTROLM_JOB}}"
            task="{{workflow.parameters.CONTROLM_TASK}}"
            ctms="{{workflow.parameters.CONTROLM_CTMS}}"
            message="xxxx"
            # POST exitCode HERE
tooptoop4 commented 1 week ago

@cdtzabra did u build the image locally with the https://github.com/argoproj/argo-workflows/pull/13866 fix in it?

cdtzabra commented 1 week ago

@tooptoop4 NO, but I will do it...