argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.95k stars 3.19k forks source link

Workflows remain if artifactgctask for different workflow from same cronworkflow fails #12621

Open bh-tt opened 8 months ago

bh-tt commented 8 months ago

Pre-requisites

What happened/what did you expect to happen?

A cronworkflow running every 15 minutes had a single workflow that failed to delete its artifacts about 1 week ago. Since then, all other workflows made by the same cronworkflow are still present, despite those having different artifact keys (set as workflow UID/workflow-name) and the argowf controller attempting to delete them. The other workflows are being deleted (they have a metadata.deletionTimestamp) but their finalizer is not removed.

We are setting the spec.artifactGC.forceFinalizerRemoval: true setting.

I expect only the workflow that could not delete its artifacts to remain, not all future workflows made by the same cronworkflow.

Version

v3.5.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

This is the workflowartifactgctask:
---
apiVersion: argoproj.io/v1alpha1                                                                                                                                                                                                                 
kind: WorkflowArtifactGCTask                                                                                                                                                                                                                     
metadata:                                                                                                                                                                                                                                        
  creationTimestamp: "2024-01-29T10:34:25Z"                                                                                                                                                                                                      
  generation: 1                                                                                                                                                                                                                                  
  labels:                                                                                                                                                                                                                                        
    workflows.argoproj.io/artifact-gc-pod: "402248588"                                                                                                                                                                                           
  name: <name>-1706524200-artgc-wfdel-283262784-0                                                                                                                                                                    
  namespace: <ns>                                                                                                                                                                                                                         
  ownerReferences:                                                                                                                                                                                                                               
  - apiVersion: argoproj.io/v1alpha1                                                                                                                                                                                                             
    blockOwnerDeletion: true                                                                                                                                                                                                                     
    controller: true                                                                                                                                                                                                                             
    kind: Workflow                                                                                                                                                                                                                               
    name: <name>-1706524200                                                                                                                                                                                          
    uid: 67472304-d23a-410b-a789-fd8a82329ca7                                                                                                                                                                                                    
  resourceVersion: "712684968"                                                                                                                                                                                                                   
  uid: c233ae6a-c47d-4b62-b7a7-39770cff4075                                                                                                                                                                                                      
spec:                                                                                                                                                                                                                                            
  artifactsByNode:                                                                                                                                                                                                                               
    <name>-1706524200-1733768575:                                                                                                                                                                                    
      archiveLocation:                                                                                                                                                                                                                           
        archiveLogs: false                                                                                                                                                                                                                       
        s3:                                                                                                                                                                                                                                      
          accessKeySecret:                                                                                                                                                                                                                       
            key: AWS_ACCESS_KEY_ID                                                                                                                                                                                                               
            name: argowf-artifacts                                                                                                                                                                                                               
          bucket: argowf-artifacts-041ebc00-a532-4469-accc-8b74fe76e0a3                                                                                                                                                                          
          endpoint: rook-ceph-rgw-ceph-obj-hdd.rook-ceph.svc.cluster.local                                                                                                                                                                       
          insecure: true                                                                                                                                                                                                                         
          key: '{{workflow.namespace}}/{{workflow.name}}/{{pod.name}}'                                                                                                                                                                           
          secretKeySecret:                                                                                                                                                                                                                       
            key: AWS_SECRET_ACCESS_KEY                                                                                                                                                                                                           
            name: argowf-artifacts                                                                                                                                                                                                               
      artifacts:                                                                                                                                                                                                                                 
        remotefiles:                                                                                                                                                                                                                             
          name: remotefiles                                                                                                                                                                                                                      
          path: /tmp/remotefiles.json                                                                                                                                                                                                            
          s3:                                                                                                                                                                                                                                    
            key: 67472304-d23a-410b-a789-fd8a82329ca7/<name>                                                                                                                                                    
    <name>-1706524200-2516494996:                                                                                                                                                                                    
      archiveLocation:                                                                                                                                                                                                                           
        archiveLogs: false                                                                                                                                                                                                                       
        s3:                                                                                                                                                                                                                                      
          accessKeySecret:                                                                                                                                                                                                                       
            key: AWS_ACCESS_KEY_ID                                                                                                                                                                                                               
            name: argowf-artifacts                                                                                                                                                                                                               
          bucket: argowf-artifacts-041ebc00-a532-4469-accc-8b74fe76e0a3                                                                                                                                                                          
          endpoint: rook-ceph-rgw-ceph-obj-hdd.rook-ceph.svc.cluster.local                                                                                                                                                                       
          insecure: true                                                                                                                                                                                                                         
          key: '{{workflow.namespace}}/{{workflow.name}}/{{pod.name}}'                                                                                                                                                                           
          secretKeySecret:                                                                                                                                                                                                                       
            key: AWS_SECRET_ACCESS_KEY                                                                                                                                                                                                           
            name: argowf-artifacts                                                                                                                                                                                                               
      artifacts:                                                                                                                                                                                                                                 
        remotefiles:                                                                                                                                                                                                                             
          name: remotefiles                                                                                                                                                                                                                      
          path: /tmp/remotefiles.json                                                                                                                                                                                                            
          s3:                                                                                                                                                                                                                                    
            key: 67472304-d23a-410b-a789-fd8a82329ca7/<name>-remotefiles

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

bh@devrd0 ~ (⎈|secnet-ams17:dev-it) $ kubectl logs -n argowf --context secnet-ams17 deploy/argo-workflows-controller | grep parsecitiworkflow-javacronworkflow-1706524200
Found 2 pods, using pod/argo-workflows-controller-6456c74555-m946z
time="2024-02-05T08:26:39.157Z" level=info msg="Processing workflow" namespace=ops-parsers workflow=parsecitiworkflow-javacronworkflow-1706524200
time="2024-02-05T08:30:00.142Z" level=error msg="was unable to obtain node for parsecitiworkflow-javacronworkflow-1706524200-1912353617"
time="2024-02-05T08:30:00.150Z" level=error msg="was unable to obtain node for parsecitiworkflow-javacronworkflow-1706524200-2046427474"
time="2024-02-05T08:30:00.159Z" level=error msg="was unable to obtain node for parsecitiworkflow-javacronworkflow-1706524200-368518479"
time="2024-02-05T08:30:00.174Z" level=error msg="was unable to obtain node for parsecitiworkflow-javacronworkflow-1706524200-2337909954"
time="2024-02-05T08:30:00.182Z" level=error msg="was unable to obtain node for parsecitiworkflow-javacronworkflow-1706524200-2203836097"
time="2024-02-05T08:30:00.188Z" level=error msg="was unable to obtain node for parsecitiworkflow-javacronworkflow-1706524200-3881745092"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Sorry, that has been deleted a while ago.

Garett-MacGowan commented 8 months ago

I could take a look at this some time over the next few days.

Garett-MacGowan commented 8 months ago

@bh-tt

To clarify, the subsequent run artifacts are being garbage collected, right? Is the problem only that the finalizer is getting stuck for subsequent runs?

Garett-MacGowan commented 8 months ago

Can you please provide kubectl describe (redacted if necessary) of the first failure & one of the subsequent failures?

bh-tt commented 7 months ago

Sorry @Garett-MacGowan, somehow the github mails from your response got lost. At this point I no longer have a failing example to describe, but if we encounter this again we will add it to this issue.

'To clarify, the subsequent run artifacts are being garbage collected, right? Is the problem only that the finalizer is getting stuck for subsequent runs?'

I have not actually checked if the other artifacts were still present, but given the number of stuck workflows I'd have expected our S3 bucket to be full if that was the case. The problem seems to be that the finalizer is stuck for subsequent runs.