argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.06k stars 3.2k forks source link

3.6.0-rc1: Controller panics due to metrics round-tripper data race #13637

Closed meln5674 closed 1 month ago

meln5674 commented 1 month ago

Pre-requisites

What happened? What did you expect to happen?

These two roundtripper wrappers, instantiated here introduce a data race.

To demonstrate, apply this modification to the Makefile

--- a/Makefile
+++ b/Makefile
@@ -223,9 +223,9 @@ controller: dist/workflow-controller
 dist/workflow-controller: $(CONTROLLER_PKG_FILES) go.sum
 ifeq ($(shell uname -s),Darwin)
        # if local, then build fast: use CGO and dynamic-linking
-       go build -gcflags '${GCFLAGS}' -v -ldflags '${LDFLAGS}' -o $@ ./cmd/workflow-controller
+       go build -race -gcflags '${GCFLAGS}' -v -ldflags '${LDFLAGS}' -o $@ ./cmd/workflow-controller
 else
-       CGO_ENABLED=0 go build -gcflags '${GCFLAGS}' -v -ldflags '${LDFLAGS} -extldflags -static' -o $@ ./cmd/workflow-controller
+       go build -race -gcflags '${GCFLAGS}' -v -ldflags '${LDFLAGS} -extldflags -static' -o $@ ./cmd/workflow-controller
 endif

Then execute make test-functional E2E_ENV_FACTOR=2 (make sure to show all logs from make start)

Stack trace of the data race follow below.

This PR adds E2E tests that currently fail, but pass with those roundtripper wrappers commented out. When the race detector is not enabled, the workflow controller panics as follows intermittently during those tests when cleaning up multiple daemon pods in rapid succession:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4edb27]

goroutine 1070 [running]:
io.ReadAtLeast({0x0, 0x0}, {0xc0015076d0, 0x4, 0x4}, 0x4)
    /usr/local/go/src/io/io.go:335 +0x67
io.ReadFull(...)
    /usr/local/go/src/io/io.go:354
encoding/binary.Read({0x0, 0x0}, {0x2c8e900, 0x41ebf00}, {0x22c3be0, 0xc000be5e24})
    /usr/local/go/src/encoding/binary/binary.go:244 +0xa5
github.com/moby/spdystream/spdy.(*Framer).ReadFrame(0xc001312fc0)
    /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/github.com/moby/spdystream/spdy/read.go:164 +0x49
github.com/moby/spdystream.(*idleAwareFramer).ReadFrame(0xc00092c140)
    /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/github.com/moby/spdystream/connection.go:167 +0x1e
github.com/moby/spdystream.(*Connection).Serve(0xc0008ba2c0, 0xc000727200)
    /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/github.com/moby/spdystream/connection.go:341 +0x265
created by k8s.io/apimachinery/pkg/util/httpstream/spdy.newConnection in goroutine 251
    /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/connection.go:94 +0x127

Both the data race and panic still occur when reverting the changes in that PR other than the additional tests.

P.S. I observed several other data races when I enabled the race detector, I strongly suggest using race detector-enabled binaries in CI. There may be several more latent issues hiding.

Version(s)

ce7f9bfb9b45f009b3e85fabe5e6410de23c7c5f

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

==================
WARNING: DATA RACE
Write at 0x00c000ce6008 by goroutine 222:
  k8s.io/apimachinery/pkg/util/httpstream/spdy.(*SpdyRoundTripper).RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/roundtripper.go:356 +0x7a4
  github.com/argoproj/argo-workflows/v3/util/logs.k8sLogRoundTripper.RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/util/logs/log-k8s-requests.go:23 +0x81
  github.com/argoproj/argo-workflows/v3/util/logs.(*k8sLogRoundTripper).RoundTrip()
      <autogenerated>:1 +0x5a
  github.com/argoproj/argo-workflows/v3/workflow/metrics.metricsRoundTripper.RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/metrics_k8s_request.go:53 +0x91
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*metricsRoundTripper).RoundTrip()
      <autogenerated>:1 +0x97
  k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/transport/round_trippers.go:168 +0x4f2
  net/http.send()
      /usr/local/go/src/net/http/client.go:259 +0x9a6
  net/http.(*Client).send()
      /usr/local/go/src/net/http/client.go:180 +0x14e
  net/http.(*Client).do()
      /usr/local/go/src/net/http/client.go:725 +0x11b5
  net/http.(*Client).Do()
      /usr/local/go/src/net/http/client.go:590 +0x6a6
  k8s.io/client-go/rest.(*Request).request()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/rest/request.go:1023 +0x3ac
  k8s.io/client-go/rest.(*Request).Do()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/rest/request.go:1063 +0x119
  k8s.io/client-go/kubernetes/typed/core/v1.(*serviceAccounts).Get()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/kubernetes/typed/core/v1/serviceaccount.go:81 +0x25d
  github.com/google/go-containerregistry/pkg/authn/kubernetes.New()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/github.com/google/go-containerregistry/pkg/authn/kubernetes/keychain.go:75 +0x381
  github.com/google/go-containerregistry/pkg/authn/k8schain.New()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/github.com/google/go-containerregistry/pkg/authn/k8schain/k8schain.go:43 +0x9e
  github.com/argoproj/argo-workflows/v3/workflow/controller/entrypoint.(*containerRegistryIndex).Lookup()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/entrypoint/container_registry_index.go:18 +0x2ab
  github.com/argoproj/argo-workflows/v3/workflow/controller/entrypoint.chainIndex.Lookup()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/entrypoint/chain_index.go:12 +0x122
  github.com/argoproj/argo-workflows/v3/workflow/controller/entrypoint.(*chainIndex).Lookup()
      <autogenerated>:1 +0xe4
  github.com/argoproj/argo-workflows/v3/workflow/controller/entrypoint.(*cacheIndex).Lookup()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/entrypoint/cache_index.go:21 +0x41a
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).createWorkflowPod()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/workflowpod.go:395 +0x6105
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeContainer()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/operator.go:2833 +0x506
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/operator.go:2171 +0x51c6
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAGTask()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/dag.go:596 +0x34a4
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAGTask()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/dag.go:521 +0x42fb
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeDAG()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/dag.go:269 +0x8a8
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/operator.go:2181 +0x4dfd
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/operator.go:368 +0x2dab
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:891 +0xefe
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:806 +0xec
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xc4
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.Until()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x4e
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap18()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:370 +0x17

Previous write at 0x00c000ce6008 by goroutine 262:
  k8s.io/apimachinery/pkg/util/httpstream/spdy.(*SpdyRoundTripper).RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/roundtripper.go:356 +0x7a4
  github.com/argoproj/argo-workflows/v3/util/logs.k8sLogRoundTripper.RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/util/logs/log-k8s-requests.go:23 +0x81
  github.com/argoproj/argo-workflows/v3/util/logs.(*k8sLogRoundTripper).RoundTrip()
      <autogenerated>:1 +0x5a
  github.com/argoproj/argo-workflows/v3/workflow/metrics.metricsRoundTripper.RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/metrics_k8s_request.go:53 +0x91
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*metricsRoundTripper).RoundTrip()
      <autogenerated>:1 +0x97
  k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/transport/round_trippers.go:168 +0x4f2
  net/http.send()
      /usr/local/go/src/net/http/client.go:259 +0x9a6
  net/http.(*Client).send()
      /usr/local/go/src/net/http/client.go:180 +0x14e
  net/http.(*Client).do()
      /usr/local/go/src/net/http/client.go:725 +0x11b5
  net/http.(*Client).Do()
      /usr/local/go/src/net/http/client.go:590 +0x6a6
  k8s.io/client-go/rest.(*Request).request()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/rest/request.go:1023 +0x3ac
  k8s.io/client-go/rest.(*Request).Do()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/rest/request.go:1063 +0x119
  k8s.io/client-go/kubernetes/typed/core/v1.(*events).CreateWithEventNamespace()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:59 +0x251
  k8s.io/client-go/kubernetes/typed/core/v1.(*EventSinkImpl).Create()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/kubernetes/typed/core/v1/event_expansion.go:160 +0x47
  k8s.io/client-go/tools/record.recordEvent()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/tools/record/event.go:340 +0x1e5
  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).recordToSink()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/tools/record/event.go:302 +0x2a9
  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartRecordingToSink.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/tools/record/event.go:279 +0x64
  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/tools/record/event.go:414 +0x1a2

Goroutine 222 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:370 +0x29d1
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:152 +0xaa

Goroutine 262 (running) created at:
  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/tools/record/event.go:400 +0x1b0
  k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartRecordingToSink()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/tools/record/event.go:277 +0x1f9
  github.com/argoproj/argo-workflows/v3/workflow/events.(*eventRecorderManager).Get()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/events/event_recorder_manager.go:71 +0x28a
  github.com/argoproj/argo-workflows/v3/workflow/controller.newWorkflowOperationCtx()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/operator.go:166 +0x451
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:864 +0x9d6
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:806 +0xec
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xc4
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.Until()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x4e
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap18()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:370 +0x17
==================

Logs from in your workflow's wait container

N/A
agilgur5 commented 1 month ago

P.S. I observed several other data races when I enabled the race detector, I strongly suggest using race detector-enabled binaries in CI. There may be several more latent issues hiding.

👀 I didn't know you could build -race, thought it was only test -race

cc @isubasinghe as we've talked about races many times before (e.g. https://github.com/argoproj/argo-workflows/issues/10807#issuecomment-2090959022) and @Joibel regarding metrics

meln5674 commented 1 month ago

👀 I didn't know you could build -race, thought it was only test -race

Yep, build -cover too, so you can get coverage data from e2e as well unit tests.

meln5674 commented 1 month ago

Upon further testing, it is specifically the metrics RoundTripper that causes the panic, but the logs one alone produces this other data race.

==================
WARNING: DATA RACE
Read at 0x00c0006a8ee8 by goroutine 177:
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewUnfinishedWorkSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:268 +0x193
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:240 +0x19d
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/cron.NewCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/cron/controller.go:84 +0x4bb
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:270 +0x3dd
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap16()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x5d

Previous write at 0x00c0006a8ee8 by goroutine 176:
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewUnfinishedWorkSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:268 +0x271
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:240 +0x19d
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/gccontroller.NewController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/gccontroller/gc_controller.go:52 +0x67a
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runGCcontroller()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:260 +0x1f4
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap15()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x5d

Goroutine 177 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x281d
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa

Goroutine 176 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x272e
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa
==================
==================
WARNING: DATA RACE
Read at 0x00c000418510 by goroutine 177:
  runtime.growslice()
      /usr/local/go/src/runtime/slice.go:177 +0x0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewUnfinishedWorkSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:268 +0x1c9
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:240 +0x19d
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/cron.NewCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/cron/controller.go:84 +0x4bb
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:270 +0x3dd
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap16()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x5d

Previous write at 0x00c000418510 by goroutine 176:
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewUnfinishedWorkSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:268 +0x20c
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:240 +0x19d
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/gccontroller.NewController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/gccontroller/gc_controller.go:52 +0x67a
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runGCcontroller()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:260 +0x1f4
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap15()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x5d

Goroutine 177 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x281d
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa

Goroutine 176 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x272e
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa
==================
==================
WARNING: DATA RACE
Read at 0x00c0006a8f68 by goroutine 177:
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewLongestRunningProcessorSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:280 +0x193
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:241 +0x1c7
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/cron.NewCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/cron/controller.go:84 +0x4bb
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:270 +0x3dd
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap16()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x5d

Previous write at 0x00c0006a8f68 by goroutine 176:
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewLongestRunningProcessorSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:280 +0x271
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:241 +0x1c7
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/gccontroller.NewController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/gccontroller/gc_controller.go:52 +0x67a
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runGCcontroller()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:260 +0x1f4
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap15()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x5d

Goroutine 177 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x281d
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa

Goroutine 176 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x272e
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa
==================
==================
WARNING: DATA RACE
Read at 0x00c0004185d0 by goroutine 177:
  runtime.growslice()
      /usr/local/go/src/runtime/slice.go:177 +0x0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewLongestRunningProcessorSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:280 +0x1c9
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:241 +0x1c7
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/cron.NewCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/cron/controller.go:84 +0x4bb
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runCronController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:270 +0x3dd
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap16()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x5d

Previous write at 0x00c0004185d0 by goroutine 176:
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).NewLongestRunningProcessorSecondsMetric()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:280 +0x20c
  k8s.io/client-go/util/workqueue.(*queueMetricsFactory).newQueueMetrics()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/metrics.go:241 +0x1c7
  k8s.io/client-go/util/workqueue.newQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:88 +0x153
  k8s.io/client-go/util/workqueue.NewWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/queue.go:59 +0x157
  k8s.io/client-go/util/workqueue.NewDelayingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:67 +0x9f
  k8s.io/client-go/util/workqueue.NewRateLimitingQueueWithConfig()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:71 +0x15c
  k8s.io/client-go/util/workqueue.NewNamedRateLimitingQueue()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/util/workqueue/rate_limiting_queue.go:87 +0xf0
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*Metrics).RateLimiterWithBusyWorkers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/work_queue.go:133 +0x16a
  github.com/argoproj/argo-workflows/v3/workflow/gccontroller.NewController()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/gccontroller/gc_controller.go:52 +0x67a
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runGCcontroller()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:260 +0x1f4
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap15()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x5d

Goroutine 177 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:365 +0x281d
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa

Goroutine 176 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:364 +0x272e
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa
==================

Similarly, in addition to the data race in the original post, running with just the metrics RoundTripper produces additional data races

==================
WARNING: DATA RACE
Read at 0x00c000f1e600 by goroutine 197:
  crypto/tls.(*Config).Clone()
      /usr/local/go/src/crypto/tls/common.go:897 +0x95c
  crypto/tls.dial()
      /usr/local/go/src/crypto/tls/tls.go:153 +0x35b
  crypto/tls.(*Dialer).DialContext()
      /usr/local/go/src/crypto/tls/tls.go:219 +0x10a
  k8s.io/apimachinery/pkg/util/httpstream/spdy.(*SpdyRoundTripper).dialWithoutProxy()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/roundtripper.go:321 +0x304
  k8s.io/apimachinery/pkg/util/httpstream/spdy.(*SpdyRoundTripper).dial()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/roundtripper.go:177 +0x2b4
  k8s.io/apimachinery/pkg/util/httpstream/spdy.(*SpdyRoundTripper).Dial()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/roundtripper.go:154 +0x133
  k8s.io/apimachinery/pkg/util/httpstream/spdy.(*SpdyRoundTripper).RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/httpstream/spdy/roundtripper.go:343 +0x5b2
  github.com/argoproj/argo-workflows/v3/workflow/metrics.metricsRoundTripper.RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/metrics/metrics_k8s_request.go:53 +0x91
  github.com/argoproj/argo-workflows/v3/workflow/metrics.(*metricsRoundTripper).RoundTrip()
      <autogenerated>:1 +0x97
  k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/transport/round_trippers.go:168 +0x4f2
  net/http.send()
      /usr/local/go/src/net/http/client.go:259 +0x9a6
  net/http.(*Client).send()
      /usr/local/go/src/net/http/client.go:180 +0x14e
  net/http.(*Client).do()
      /usr/local/go/src/net/http/client.go:725 +0x11b5
  net/http.(*Client).Do()
      /usr/local/go/src/net/http/client.go:590 +0x6a6
  k8s.io/client-go/rest.(*Request).request()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/rest/request.go:1023 +0x3ac
  k8s.io/client-go/rest.(*Request).Do()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/rest/request.go:1063 +0x119
  github.com/argoproj/argo-workflows/v3/pkg/client/clientset/versioned/typed/workflow/v1alpha1.(*workflows).Update()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/pkg/client/clientset/versioned/typed/workflow/v1alpha1/workflow.go:117 +0x2a7
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).persistUpdates()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/operator.go:761 +0xaaf
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/operator.go:191 +0x46
  runtime.deferreturn()
      /usr/local/go/src/runtime/panic.go:605 +0x5d
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:891 +0xefe
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:806 +0xec
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xc4
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.Until()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x4e
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap18()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:370 +0x17

Previous write at 0x00c000f1e600 by goroutine 172:
  k8s.io/client-go/transport.TLSConfigFor()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/transport/transport.go:88 +0x24b
  k8s.io/client-go/rest.TLSConfigFor()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/rest/transport.go:57 +0x51
  k8s.io/client-go/transport/spdy.RoundTripperFor()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/transport/spdy/spdy.go:38 +0x3b
  k8s.io/client-go/tools/remotecommand.NewSPDYExecutor()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/client-go/tools/remotecommand/spdy.go:46 +0x39
  github.com/argoproj/argo-workflows/v3/workflow/common.ExecPodContainer()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/common/util.go:93 +0xf84
  github.com/argoproj/argo-workflows/v3/workflow/signal.ExecPodContainerAndGetOutput()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/signal/signal.go:49 +0xe6
  github.com/argoproj/argo-workflows/v3/workflow/signal.SignalContainer()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/signal/signal.go:45 +0x764
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).signalContainers()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:675 +0x1d5
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextPodCleanupItem.func2()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:564 +0x6dd
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextPodCleanupItem()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:593 +0x3cf
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runPodCleanup()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:539 +0x47
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runPodCleanup-fm()
      <autogenerated>:1 +0x21
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x46
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x41
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xc4
  k8s.io/apimachinery/pkg/util/wait.JitterUntil()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x10a
  k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0xd2
  k8s.io/apimachinery/pkg/util/wait.UntilWithContext()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:170 +0x5c
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run.gowrap12()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:359 +0x17

Goroutine 197 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:370 +0x29d1
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa

Goroutine 172 (running) created at:
  github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).Run()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/workflow/controller/controller.go:359 +0x2316
  main.NewRootCommand.func1.1.gowrap1()
      /home/andrew/git/src/github.com/meln5674/argo-workflows/cmd/workflow-controller/main.go:151 +0xaa
==================
meln5674 commented 1 month ago

Smoking gun has been identified, will make PR as soon as tests finish running.

isubasinghe commented 1 month ago

cc @isubasinghe as we've talked about races many times before (e.g. #10807 (comment)) and @Joibel regarding metrics

Yeah I am not really surprised by the data races, golang doesn't give enough to catch these at compile time. We should enable -race in CI though, I did actually assume this was already done.

agilgur5 commented 1 month ago

I did actually assume this was already done.

I think we have little to no pre-existing Go SMEs, with a lot of us learning as we go. I know you and Alan have been learning some of the tooling and idioms and might have the most Go knowledge these days

The race detector just gives warnings though, right? Or can we make it error out in CI? Because otherwise we'll still miss the warnings in the logs unless someone parses through them regularly

isubasinghe commented 1 month ago

@agilgur5 it does exit with a non-zero exit code and could be detected in the CI, but I presume its not in CI because of how long it takes? It's basically a fuzzer not a static analysis tool and needs to brute force the search space effectively.

I would personally still have it in CI maybe running on the releases only or on a weekly cadence on main at least.

meln5674 commented 1 month ago

One thing I noticed in running the e2e tests locally is that, by default, the "kit" setup doesn't show logs for the operator, nor does it fail if either the server or controller crash, both of which seem counter to the point of an e2e test. There really ought to be some part of the CI setup that dumps the logs to stdout/err at the end disables restarts and causes the test to fail because there's nothing to actually act upon those tests, unless there's a test that specifically interrogates behavior after a restart.

I think we have little to no pre-existing Go SMEs, with a lot of us learning as we go. I know you and Alan have been learning some of the tooling and idioms and might have the most Go knowledge these days

While it may be a little pompous to describe myself as a Go SME, I have been writing in it professionally for about 6 years at this point, and we leverage Argo in quite a few places at my current employer, so if that experience is of use to you, it would be to my direct benefit to offer it beyond just this one bug, especially considering this particular issue ended up blocking my other PR whose functionality we are eagerly awaiting.

but I presume its not in CI because of how long it takes? It's basically a fuzzer not a static analysis tool and needs to brute force the search space effectively.

The performance hit to the app itself is relatively small on the scale of an automated test, the main issue is that it substantially increases compile times, and also requires a complete separate compile step (including recompiling all dependencies with the detector) compared to the "release" binaries, but I agree that a scheduled fuzz run plus as a gate before any numbered release is the best approach for the immediate future.

agilgur5 commented 1 month ago

but I presume its not in CI because of how long it takes?

That was one reason why I never added it myself, though the original reason is probably that folks didn't know about it. In https://github.com/argoproj/argo-workflows/issues/10807#issuecomment-2090959022, I mentioned running it nightly or weekly. The race detector docs say "memory usage may increase by 5-10x and execution time by 2-20x.". The lower end of that would be feasible per PR.

I would personally still have it in CI maybe running on the releases only or on a weekly cadence on main at least.

I'd suggest nightly to make it easier to narrow down regressions. Not that we'd necessarily check it nightly, but we could back through each nightly run's history at least. I'll write up an issue for this.

It's basically a fuzzer not a static analysis tool and needs to brute force the search space effectively.

That's what I thought before too, but apparently it's not per the docs. I can't seem to find the docs I read earlier that made me think that. It's actually an instrumentation tool, similar to code coverage, and detects races via instrumentation. As such though, it only detects races that happen during run-time and not all theoretical races, so only a subset of conditions that we would be testing for, and we definitely miss some races in tests.

For reference for other folks in this thread, Isitha and I have previously discussed rewriting the core of Argo in a language with a more intelligent compiler, such as in Rust or Pony, which will detect at compile time a lot of issues we've had. "The core" is basically DAG processing / state machine, i.e. #13502. Notably, that actually wouldn't cover the case of this issue, as this is in the networking / metrics stack, which makes heavy usage of k8s libraries that have first-class Go support.

There really ought to be some part of the CI setup that dumps the logs to stdout/err at the end

Dumping the logs does actually already happen on CI

nor does it fail if either the server or controller crash

A test case itself should fail if there's a crash, so this should still result in a test failure.

While it may be a little pompous to describe myself as a Go SME [...] so if that experience is of use to you, it would be to my direct benefit to offer it beyond just this one bug

You and Mason, another recent contributor, both seem to have substantially more Go experience than me at least (Argo is effectively the largest Go project I've ever worked on, and probably the one I've wrote the most Go code for as well, since I tend to be architect and/or reviewer and/or debugger far more often than writer in my day job these days), so I think that's a fair statement to make to me at least as it is self-evident 😅 And yea all contributions are welcome and we are definitely in need of more contributors too! See also the Sustainability Effort. We'd be happy for more help, especially from someone who knows the Go ecosystem well! 🙂

the main issue is that it substantially increases compile times, and also requires a complete separate compile step (including recompiling all dependencies with the detector)

Now that I've re-read the race detector docs a few times over, it sounds like go test -race would be enough for 99% of our tests, as nearly all of them run via go test. So that could simplify the CI part a good bit if we don't need a separate build step.

But if we're writing a scheduled variant instead of a per PR job, it would require a whole new GHA workflow anyway, so a different compile step wouldn't actually be that much more effort either

agilgur5 commented 1 month ago

I would personally still have it in CI maybe running on the releases only or on a weekly cadence on main at least.

I'd suggest nightly to make it easier to narrow down regressions. Not that we'd necessarily check it nightly, but we could back through each nightly run's history at least. I'll write up an issue for this.

Wrote up https://github.com/argoproj/argo-workflows/issues/13661