argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.88k stars 3.17k forks source link

v3.3: Retrieving pod artifacts using Web UI from S3 is slow #8257

Closed chriswilley closed 2 years ago

chriswilley commented 2 years ago

Summary

What happened/what you expected to happen?

After a pod completes, when we click on any input or output artifact (such as the main container log) in the Web UI it takes about a full minute to pull from S3. Previously on v3.1.12 it was very quick. Nothing about the network or EKS configurations have changed.

What version are you running? v3.3.0


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

alexec commented 2 years ago

How slow is “slow”?

chriswilley commented 2 years ago

I just timed pulling a 2KB log file and it took 2m 2s to appear.

alexec commented 2 years ago

Have you tested in v3.2?

a3lk4i commented 2 years ago

I've got the same issue and it's only in v3.3.0 and v3.3.1. v3.2.* had no issue.

alexec commented 2 years ago

Let me get this right+

a3lk4i commented 2 years ago

yes, that's right.

alexec commented 2 years ago

Candidates for blame:

alexec commented 2 years ago

Does it always take 2m2s? Are you using browser?

chriswilley commented 2 years ago

I'd have to do more timings, but anecdotally they all take about the same time.

chriswilley commented 2 years ago

And yes to browser. (Google Chrome)

a3lk4i commented 2 years ago

Looks like it. These are the logs from the wait container where the artifacts (container logs in this case) are uploaded

time="2022-04-04T15:58:19.454Z" level=info msg="Starting deadline monitor"
time="2022-04-04T16:01:03.480Z" level=info msg="Main container completed"
time="2022-04-04T16:01:03.480Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-04-04T16:01:03.480Z" level=info msg="Saving logs"
time="2022-04-04T16:01:03.481Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: <redacted>/main.log"
time="2022-04-04T16:01:03.481Z" level=info msg="Creating minio client using IAM role"
time="2022-04-04T16:01:03.481Z" level=info msg="Saving file to s3" bucket=<redacted> endpoint=s3.amazonaws.com key=<redacted>/main.log path=/tmp/argo/outputs/logs/main.log
time="2022-04-04T16:03:07.717Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2022-04-04T16:03:07.717Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2022-04-04T16:03:07.717Z" level=info msg="No output parameters"
time="2022-04-04T16:03:07.717Z" level=info msg="No output artifacts"
time="2022-04-04T16:03:07.736Z" level=info msg="Create workflowtaskresults 201"
time="2022-04-04T16:03:07.737Z" level=info msg="Killing sidecars []"
time="2022-04-04T16:03:07.737Z" level=info msg="Alloc=5688 TotalAlloc=14812 Sys=23506 NumGC=7 Goroutines=10"

The main.log file is about 300KB here

alexec commented 2 years ago

If they take the exact same time, then that hints at some problem with Argo Server in one place.

alexec commented 2 years ago

Can you attach argo-server logs.

a3lk4i commented 2 years ago

During the pod run

time="2022-04-04T15:58:18.145Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.204Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.212Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=10.506 span.kind=server system=grpc
time="2022-04-04T16:00:27.547Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:39.579Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WorkflowLogs grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=12034.497 span.kind=server system=grpc
time="2022-04-04T16:02:15.428Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:58:18Z" grpc.time_ms=237286.6 span.kind=server system=grpc
time="2022-04-04T16:03:18.092Z" level=info msg="Alloc=12073 TotalAlloc=5976661 Sys=29265 NumGC=7960 Goroutines=93"
time="2022-04-04T16:08:18.092Z" level=info msg="Alloc=11273 TotalAlloc=5978609 Sys=29265 NumGC=7963 Goroutines=93"
time="2022-04-04T16:11:22.522Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:57:36Z" grpc.time_ms=825619.3 span.kind=server system=grpc
time="2022-04-04T15:58:18.145Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.204Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.212Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=10.506 span.kind=server system=grpc
time="2022-04-04T16:00:27.547Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:39.579Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WorkflowLogs grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=12034.497 span.kind=server system=grpc
time="2022-04-04T16:02:15.428Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:58:18Z" grpc.time_ms=237286.6 span.kind=server system=grpc
time="2022-04-04T16:03:18.092Z" level=info msg="Alloc=12073 TotalAlloc=5976661 Sys=29265 NumGC=7960 Goroutines=93"
time="2022-04-04T16:08:18.092Z" level=info msg="Alloc=11273 TotalAlloc=5978609 Sys=29265 NumGC=7963 Goroutines=93"
time="2022-04-04T16:11:22.522Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:57:36Z" grpc.time_ms=825619.3 span.kind=server system=grpc

When getting logs from UI

time="2022-04-04T16:59:10.573Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:59:10.582Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListArchivedWorkflowLabelKeys grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2022-04-04T16:59:10Z" grpc.time_ms=14.643 span.kind=server system=grpc
time="2022-04-04T16:59:10.589Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListArchivedWorkflows grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2022-04-04T16:59:10Z" grpc.time_ms=18.937 span.kind=server system=grpc
time="2022-04-04T16:59:12.847Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:59:12.854Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetArchivedWorkflow grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2022-04-04T16:59:12Z" grpc.time_ms=9.938 span.kind=server system=grpc
time="2022-04-04T17:00:07.981Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T17:00:07.986Z" level=info msg="Download artifact" artifactName=main-logs isInput=false nodeId=<redacted> uid=d8fea3cc-b687-4373-a20a-0a9259157e31
time="2022-04-04T17:00:07.986Z" level=info msg="S3 Load path: /tmp/artifact3631983470, key: <redacted>/main.log"
time="2022-04-04T17:00:07.986Z" level=info msg="Creating minio client using IAM role"
time="2022-04-04T17:00:07.987Z" level=info msg="Getting file from s3" bucket=<redacted> endpoint=s3.amazonaws.com key=<redacted>/main.log path=/tmp/artifact3631983470
time="2022-04-04T17:01:07.986Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T17:01:07.990Z" level=info msg="Download artifact" artifactName=main-logs isInput=false nodeId=<redacted> uid=d8fea3cc-b687-4373-a20a-0a9259157e31
time="2022-04-04T17:01:07.990Z" level=info msg="S3 Load path: /tmp/artifact2233602876, key: <redacted>/main.log"
time="2022-04-04T17:01:07.990Z" level=info msg="Creating minio client using IAM role"
time="2022-04-04T17:01:07.990Z" level=info msg="Getting file from s3" bucket=<redacted> endpoint=s3.amazonaws.com key=<redacted>/main.log path=/tmp/artifact2233602876
2022/04/04 17:02:11 http: superfluous response.WriteHeader call from github.com/argoproj/argo-workflows/v3/server/artifacts.(*ArtifactServer).returnArtifact (artifact_server.go:225)
2022/04/04 17:03:10 http: superfluous response.WriteHeader call from github.com/argoproj/argo-workflows/v3/server/artifacts.(*ArtifactServer).returnArtifact (artifact_server.go:225)

Maybe due to the timeouts set on my ingress, I never see the logs

alexec commented 2 years ago

Do you have the same problem when connecting directly, rather than via ingress or LB?

chriswilley commented 2 years ago

All my testing has been done when connecting through kubectl proxy direct to the cluster.

alexec commented 2 years ago

I think we need to do live debugging.

Could you please set-up 30m?

https://bit.ly/book-30m-with-argo-team

chriswilley commented 2 years ago

Zoom is not accepting my password

chriswilley commented 2 years ago

I believe I'm in the waiting room now

alexec commented 2 years ago

Hmm, I don't see your booking, maybe you got booked with someone else. Could you do 10:30am PT tomorrow?

chriswilley commented 2 years ago

Yes

alexec commented 2 years ago

Cool. Are you able to book that slot tomorrow?

chriswilley commented 2 years ago

It says there are no slots available

alexec commented 2 years ago

11am or 11:30am?

chriswilley commented 2 years ago

do you see a booking for me at 1pm EST Wednesday?

chriswilley commented 2 years ago

Are you still available now?

alexec commented 2 years ago

git log --oneline v3.2.9..v3.3.0 -- go.mod

f0fb0d56d chore(deps): bump github.com/minio/minio-go/v7 from 7.0.15 to 7.0.23 (#8003) aea6c3912 chore(deps): bump github.com/gorilla/websocket from 1.4.2 to 1.5.0 (#7991) 89d7cc39d chore(deps): bump github.com/tidwall/gjson from 1.13.0 to 1.14.0 (#7992) 6076524cd chore: Upgrade to Kubernetes v0.23. Fixes #6750 (#7832) 980f2feb7 chore(deps): bump github.com/gavv/httpexpect/v2 from 2.2.0 to 2.3.1 (#7979) 5e45cd95a chore(deps): bump github.com/antonmedv/expr from 1.8.9 to 1.9.0 (#7967) 79fc4a9be (tag: v3.3.0-rc6, tag: v3.3.0-rc5) chore(deps): bump upper.io/db.v3 (#7939) 848effce0 chore(deps): bump github.com/aliyun/aliyun-oss-go-sdk (#7919) 044389b55 chore(deps): bump github.com/aliyun/aliyun-oss-go-sdk (#7901) dc82f3f42 chore(deps): bump github.com/prometheus/client_golang (#7880) 69c5bc79f fix: Fix go-jose dep. Fixes #7814 (#7874) 1d7a17714 chore(deps): bump github.com/soheilhy/cmux from 0.1.4 to 0.1.5 (#7848) 27977070c chore(deps): bump github.com/go-openapi/spec from 0.20.2 to 0.20.4 (#7817) 74d1bbef7 chore(deps): bump cloud.google.com/go/storage from 1.19.0 to 1.20.0 (#7747) 1277f0579 chore(deps): bump gopkg.in/square/go-jose.v2 from 2.5.1 to 2.6.0 (#7740) 7e6f2c0d7 chore(deps): bump github.com/valyala/fasttemplate from 1.1.0 to 1.2.1 (#7727) 877d65697 chore(deps): bump cloud.google.com/go/storage from 1.10.0 to 1.19.0 (#7714) 0f4c48473 chore(deps): bump gopkg.in/go-playground/webhooks.v5 (#7704) 4664aeac4 chore(deps): bump google.golang.org/grpc from v1.38.0 to v1.41.1 (#7658) ecf2ceced chore(deps): bump github.com/grpc-ecosystem/go-grpc-middleware (#7679) 559153417 chore(deps): bump github.com/prometheus/common from 0.26.0 to 0.32.1 (#7660) d7bc12343 chore: Upgrade tidwall/gjson to improve security (#7237) a6ce659f8 feat!: Remove the hidden flag verify from argo submit (#7158) f9e554d26 fix: Relative submodules in git artifacts. Fixes #7141 (#7162) 8d7ca73b0 feat: Upgrade to Golang 1.17 (#7029) d815031dc chore: Upgrade argo event 1.15 (#6932) 606011b1c build: simplify using Golang 1.16 (#6893) c962bb189 fix: upgrade sprig to v3.2.2 (#6876)

alexec commented 2 years ago

rm

alexec commented 2 years ago

@chriswilley I've created an engineering build that adds additional informational logging. This should help diagnose where the problem is. It will take 1h for the image to be published. The tag will be :dev-log. Can you also test with :latest to make sure that this issue has not been accidentally fixed?

You only need to change the argo-server image tag. Leave other configration unchanged.

chriswilley commented 2 years ago

Will do. Thanks.

tooptoop4 commented 2 years ago

I think I found the cause minio-go 7.0.23 introduced a regression that is fixed in 7.0.24: https://github.com/minio/minio-go/pull/1626

alexec commented 2 years ago

8401

alexec commented 2 years ago

I've cherry-picked https://github.com/argoproj/argo-workflows/commit/a0148c1b32fef820a0cde5a5fed1975abedb7f82 to fix https://github.com/argoproj/argo-workflows/issues/8257.

alexec commented 2 years ago

Will be fixed in v3.3.2