argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.13k stars 3.21k forks source link

Retrying loading of S3 artifacts breaks if the artifact is a directory #10988

Open nilsalex opened 1 year ago

nilsalex commented 1 year ago

Pre-requisites

What happened/what you expected to happen?

I run workflows using a very unstable MinIO as artifact repository. Under load, S3 requests against this MinIO are prone to lose their connections. This leads to some retries in the init/wait containers.

I noticed that for directory artifacts, these retries always produce a non-transient error:

artifact myArtifact failed to load: failed to get file: fileName is a directory.

The root cause seems to me that after the first attempt, the directory myArtifact.tmp has already been created. During the retry, the executor first tries to download the artifact as file:

https://github.com/argoproj/argo-workflows/blob/0adba4b3db288e9222814055937588ad0c601d85/workflow/artifacts/s3/s3.go#L85-L94

Which fails in the minio client:

https://github.com/minio/minio-go/blob/0be3a44757352b6e617ef00eb47829bce29baab1/api-get-object-file.go#L51-L57

Version

v3.4.6

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

This workflow would suffer from the issue if it came to a retry:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: directory-artifact-example
spec:
  entrypoint: generate-and-consume-artifact
  templates:
  - name: generate-and-consume-artifact
    steps:
    - - name: generate-artifact
        template: generate-artifact
    - - name: consume-artifact
        template: consume-artifact
        arguments:
          artifacts:
          - name: my-artifact
            from: "{{steps.generate-artifact.outputs.artifacts.my-artifact}}"

  - name: generate-artifact
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["mkdir -p /my_artifact; touch /my_artifact/file1; touch /my_artifact/file2"]
    outputs:
      artifacts:
      - name: my-artifact
        path: /my_artifact
        archive:
          none: {}
        s3:
          key: 'directory-artifact-example/my-artifact/'

  - name: consume-artifact
    inputs:
      artifacts:
      - name: my-artifact
        path: /my_artifact
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["ls -la /my_artifact"]

In order to simulate what happens when after a first unsuccessful download the directory my_artifact.tmp has already been created, I changed the workflow as follows:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: directory-artifact-example
spec:
  entrypoint: generate-and-consume-artifact
  volumeClaimTemplates:
    - metadata:
        name: volume
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 1Gi
  templates:
  - name: generate-and-consume-artifact
    steps:
    - - name: generate-artifact
        template: generate-artifact
    - - name: prepare-empty-directory
        template: prepare-empty-directory
    - - name: consume-artifact
        template: consume-artifact
        arguments:
          artifacts:
          - name: my-artifact
            from: "{{steps.generate-artifact.outputs.artifacts.my-artifact}}"

  - name: generate-artifact
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["mkdir -p /my_artifact; touch /my_artifact/file1; touch /my_artifact/file2"]
    outputs:
      artifacts:
      - name: my-artifact
        path: /my_artifact
        archive:
          none: {}
        s3:
          key: 'directory-artifact-example/my-artifact/'

  - name: prepare-empty-directory
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["mkdir /mount/my_artifact.tmp"]
      volumeMounts:
        - name: volume
          mountPath: /mount

  - name: consume-artifact
    inputs:
      artifacts:
      - name: my-artifact
        path: /mount/my_artifact
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["ls -la /mount/my_artifact"]
      volumeMounts:
        - name: volume
          mountPath: /mount

Logs from the workflow controller

I cannot easily get logs, but I hope they are not really relevant for this problem.

Logs from in your workflow's wait container

This is the init container. You can see the transient error and then the non-transient error from the retry.

{"level":"info","msg":"Starting Workflow Executor","time":"2023-04-25T15:13:44.931Z","version":"v3.4.6"}
{"Duration":1000000000,"Factor":1.6,"Jitter":0.5,"Steps":5,"level":"info","msg":"Using executor retry strategy","time":"2023-04-25T15:13:44.936Z"}
{"level":"info","msg":"Loading script source to /argo/staging/script","time":"2023-04-25T15:13:45.744Z"}
{"level":"info","msg":"Start loading input artifacts...","time":"2023-04-25T15:13:45.744Z"}
{"level":"info","msg":"Downloading artifact: my-artifact","time":"2023-04-25T15:13:45.744Z"}
{"level":"info","msg":"Specified artifact path /mount/my_artifact overlaps with volume mount at /mount/. Extracting to volume mount","time":"2023-04-25T15:13:45.744Z"}
{"level":"info","msg":"S3 Load path: /mainctrfs/mount/my_artifact.tmp, key: my-artifact/","time":"2023-04-25T15:13:45.744Z"}
{"endpoint":"minio:9000","level":"info","msg":"Creating minio client using static credentials","time":"2023-04-25T15:13:45.744Z"}
{"bucket":"artifact-bucket","endpoint":"minio:9000","key":"my-artifact/","level":"info","msg":"Getting file from s3","path":"/mainctrfs/mount/my_artifact.tmp","time":"2023-04-25T15:13:45.744Z"}
{"bucket":"artifact-bucket","endpoint":"minio:9000","key":"my-artifact/","level":"info","msg":"Getting directory from s3","path":"/mainctrfs/mount/my_artifact.tmp","time":"2023-04-25T15:13:45.862Z"}
{"bucket":"artifact-bucket","endpoint":"minio:9000","key":"my-artifact/","level":"info","msg":"Listing directory from s3","time":"2023-04-25T15:13:45.862Z"}
{"level":"info","msg":"Transient error: read tcp 100.65.5.246:34096-\u003e100.66.20.114:9000: read: connection reset by peer","time":"2023-04-25T15:13:55.943Z"}
{"level":"info","msg":"S3 Load path: /mainctrfs/mount/my_artifact.tmp, key: my-artifact/","time":"2023-04-25T15:13:57.393Z"}
{"endpoint":"minio:9000","level":"info","msg":"Creating minio client using static credentials","time":"2023-04-25T15:13:57.393Z"}
{"bucket":"artifact-bucket","endpoint":"minio:9000","key":"my-artifact/","level":"info","msg":"Getting file from s3","path":"/mainctrfs/mount/my_artifact.tmp","time":"2023-04-25T15:13:57.393Z"}
{"level":"warning","msg":"Non-transient error: fileName is a directory.","time":"2023-04-25T15:13:57.394Z"}
{"artifactName":"my-artifact","duration":11649843351,"error":"failed to get file: fileName is a directory.","key":"my-artifact/","level":"info","msg":"Load artifact","time":"2023-04-25T15:13:57.394Z"}
{"level":"error","msg":"executor error: artifact my-artifact failed to load: failed to get file: fileName is a directory.","time":"2023-04-25T15:13:57.394Z"}
{"level":"info","msg":"Alloc=8911 TotalAlloc=14291 Sys=23762 NumGC=3 Goroutines=5","time":"2023-04-25T15:13:57.395Z"}
{"level":"fatal","msg":"artifact my-artifact failed to load: failed to get file: fileName is a directory.","time":"2023-04-25T15:13:57.395Z"}
stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

tooptoop4 commented 3 weeks ago

related to https://github.com/argoproj/argo-workflows/issues/9908