argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.98k stars 3.19k forks source link

artifactGC not working with IRSA #9746

Open tooptoop4 opened 2 years ago

tooptoop4 commented 2 years ago

Pre-requisites

What happened/what you expected to happen?

what happened? artifactgc step failed, object still on s3 and workflow did not get archived

what expected? artifactgc step succeed, object removed off s3 and workflow gets archived

Version

v3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

              spec:
                serviceAccountName: argo-workflows-server
                artifactGC:
                  strategy: OnWorkflowCompletion
                  serviceAccountName: argo-workflows-workflow-controller
                  podMetadata:
                    annotations:
                      eks.amazonaws.com/role-arn: "arn:aws:iam::redact:role/redact..approle"

logs from the artgc pod:

{"log":"time=\"2022-10-04T10:16:46.328Z\" level=info msg=\"S3 Delete artifact: key: redact/main.log\"\n","stream":"stderr","time":"2022-10-04T10:16:46.32863739Z"}
{"log":"time=\"2022-10-04T10:16:46.328Z\" level=info msg=\"Creating minio client using AWS SDK credentials\"\n","stream":"stderr","time":"2022-10-04T10:16:46.328647872Z"}
{"log":"time=\"2022-10-04T10:16:46.573Z\" level=warning msg=\"Non-transient error: NoCredentialProviders: no valid providers in chain. Deprecated.\\n\\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors\"\n","stream":"stderr","time":"2022-10-04T10:16:46.573355156Z"}

note: for other AWS things to work i set volume/volumemounts/env variables as in https://github.com/argoproj/argo-workflows/discussions/7461 but artifactgc pod doesn't seem to accept that?

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

terrytangyuan commented 2 years ago

Can you check if the env/volume has been added to the pod?

sarabala1979 commented 2 years ago

@tooptoop4 Can you try with a simple POD resource that can access the AWS using IRSA?

juliev0 commented 2 years ago

I don't know a lot about the various ways of configuring IRSA but what what we tried to provide was two methods of access in S3:

  1. annotations - your eks.amazonaws.com/role-arn should have been transferred to the Artifact GC Pod
  2. you can configure ServiceAccount for Artifact GC in the Workflow spec
juliev0 commented 2 years ago

Could you do something like what's shown here?

sarabala1979 commented 2 years ago

This is enhancement. MVP version supports metadata. we can enhance to support env too. @tooptoop4 Do you like to submit PR for this enhancement?

juliev0 commented 2 years ago

For anybody working on this, the idea would be to take the ArtifactGC struct - which can be used both on the WorkflowSpec level and the Artifact level - and add the capability to specify environment variables there.

The way ArtifactGC Pods get created currently, for each unique GC strategy, we need to create one unique Pod per unique set of annotations and labels defined in ArtifactGC struct.

So, the functionality in artifact_gc.go is taking all the artifacts defined and grouping them by annotations/labels. What we need to do is add environment variables to this.

tooptoop4 commented 2 years ago

I guess need volumeMounts/volumes too

juliev0 commented 2 years ago

I guess need volumeMounts/volumes too

yes, you're right

AnshulAngaria commented 1 year ago

Hi @terrytangyuan, @sarabala1979, @juliev0, I would like to work on this issue.

juliev0 commented 1 year ago

Hi @terrytangyuan, @sarabala1979, @juliev0, I would like to work on this issue.

thanks!

juliev0 commented 1 year ago

By the way, maybe you already found it but the current logic for grouping artifacts by pod access is here.

juliev0 commented 1 year ago

Unfortunately, there is this issue related to the ArtifactGC test. It seems to be an issue of artifacts often not getting saved to minio. I believe if each Workflow is run manually one at a time it doesn't happen.

juliev0 commented 1 year ago

Unfortunately, there is this issue related to the ArtifactGC test. It seems to be an issue of artifacts often not getting saved to minio. I believe if each Workflow is run manually one at a time it doesn't happen.

@AnshulAngaria I'm putting together a PR in which if the Workflow fails, I don't verify the artifacts. Hopefully, this should take care of it.

juliev0 commented 1 year ago

Something else: If you start up Argo Workflows by running locally, it creates a deployment for minio and port forwards ports 9000 and 9001 to the local machine, so you should be able to open up localhost:9001 in a browser. However, with the latest minio Docker image being pulled, I wasn't able to log in. I think something changed. I found I had to use an old version here (minio/minio:RELEASE.2022-08-13T21-54-44Z) to be able to successfully log in to localhost:9001. I will try to investigate further and/or submit a PR to fix the image tag.

AnshulAngaria commented 1 year ago

@juliev0 I am able to login to minio

juliev0 commented 1 year ago

@juliev0 I am able to login to minio

That's interesting! Thanks for checking. Maybe I'll ask around to see if it's happening to anyone else then.

juliev0 commented 1 year ago

Unfortunately, there is this issue related to the ArtifactGC test. It seems to be an issue of artifacts often not getting saved to minio. I believe if each Workflow is run manually one at a time it doesn't happen.

@AnshulAngaria I'm putting together a PR in which if the Workflow fails, I don't verify the artifacts. Hopefully, this should take care of it.

I have a PR for this here.

juliev0 commented 1 year ago

I have a PR for this here.

PR is now merged

juliev0 commented 1 year ago

@juliev0 I am able to login to minio

I did notice that by default the ImagePullPolicy is not set here, and I believe by default it's "IfNotPresent", so if it was on your machine already you may not have grabbed the latest and greatest. So, I'm not sure if you were actually running the newest one or not. I am asking a colleague to try it out on his end.

AnshulAngaria commented 1 year ago

For anybody working on this, the idea would be to take the ArtifactGC struct - which can be used both on the WorkflowSpec level and the Artifact level - and add the capability to specify environment variables there.

The way ArtifactGC Pods get created currently, for each unique GC strategy, we need to create one unique Pod per unique set of annotations and labels defined in ArtifactGC struct.

So, the functionality in artifact_gc.go is taking all the artifacts defined and grouping them by annotations/labels. What we need to do is add environment variables to this.

Is there anywhere else I need to make change to add support for a new variable in ArtifactGC struct. After adding a new variable and running make codegen, I am not able to reference it if I add an ArtifactGC at artifact level. However, I am able to reference it if I add ArtifactGC at workflow level.

juliev0 commented 1 year ago

@AnshulAngaria Hmm, do you mean you're not able to compile if you try to reference the variable? (and I assume it's in capitals) Or do you mean if you try to test with a Workflow it's not parsing the yaml successfully?