DataDog / chaos-controller

:monkey: :fire: Datadog Failure Injection System for Kubernetes
Apache License 2.0
173 stars 27 forks source link

User Request: Release Dynamic Targeting behind a feature flag in controller #575

Open baluishere opened 2 years ago

baluishere commented 2 years ago

Note: While chaos-controller is open to the public and we consider all suggestions for improvement, we prioritize feature development that is immediately applicable to chaos engineering initiatives within Datadog. We encourage users to contribute ideas to the repository directly in the form of pull requests!

Is your feature request related to a problem? Please describe.

After deploying controller version 7.2.0 which has dynamic targeting enabled by default, we observed that the controller enters into a restart loop (Back-off restart) while running experiments. (Error attached at the bottom). We were in controller version 6.1.0 and upgraded to 7.2.0. However when turning off dynamic targeting by adding statingTargeting: true in the defintion, the controller works as expected. We are in the process of evaluating the dynamic targeting feature however as this feature is enabled by default in new versions there is a risk of teams running this without knowing the complete impact. It would be great if dynamic targeting can be released under a config in controller so that we can block this until teams are confident and/or iron out issues we see in our cluster due to dynamic targeting. This also enables us to use the newer versions of controller while we sort dynamic targeting.

Describe the solution you'd like Release dynamic targeting behind a feature flag. A configuration to enable/disable dynamic targeting in the configmap.yaml so we can disable/enable this feature from the controller side.

Describe alternatives you've considered Open to any other ideas that would enable/disable dynamic targeting from controller

Errors seeing when running experiments with dynamic targeting

{"level":"info","ts":1660818040268.9556,"caller":"chaos-controller/main.go:267","message":"loading configuration file","config":"/etc/chaos-controller/config.yaml"}
I0818 10:20:41.321192       1 request.go:665] Waited for 1.031103301s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/pkg.crossplane.io/v1?timeout=32s
{"level":"info","ts":1660818045975.2166,"caller":"eventbroadcaster/notifiersink.go:40","message":"notifier noop enabled"}
{"level":"info","ts":1660818045978.2427,"caller":"chaos-controller/main.go:424","message":"restarting chaos-controller"}
I0818 10:20:45.978442       1 leaderelection.go:248] attempting to acquire leader lease chaos-engineering-framework/75ec2fa4.datadoghq.com...
I0818 10:21:02.864515       1 leaderelection.go:258] successfully acquired lease chaos-engineering-framework/75ec2fa4.datadoghq.com
I0818 10:21:04.017267       1 request.go:665] Waited for 1.046628643s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/athena.aws.crossplane.io/v1alpha1?timeout=32s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1444dba]

goroutine 691 [running]:
github.com/DataDog/chaos-controller/controllers.(*DisruptionReconciler).manageInstanceSelectorCache(0xc000824000, 0xc0004e0240)
    /go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/controllers/cache_handler.go:514 +0x63a
github.com/DataDog/chaos-controller/controllers.(*DisruptionReconciler).Reconcile(0xc000824000, {0x1b730d8?, 0xc0010c1a70?}, {{{0xc000a4de60?, 0x1b?}, {0xc000a4de20?, 0x20?}}})
    /go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/controllers/disruption_controller.go:124 +0x4c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00082d040, {0x1b730d8, 0xc0010c19b0}, {{{0xc000a4de60?, 0x17c79c0?}, {0xc000a4de20?, 0xc000716d40?}}})
    /go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114 +0x222
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00082d040, {0x1b73030, 0xc000698580}, {0x175aa00?, 0xc00064d940?})
    /go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311 +0x2e9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00082d040, {0x1b73030, 0xc000698580})
    /go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
    /go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
    /go/src/github.com/gsQ9JVMR/0/DataDog/chaos-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x309
nathantournant commented 2 years ago

Hi, The bug was found, it seems an inconsistent call to a config pointer was at fault, which we've always set internally but you probably (and rightfully !) didn't. Here's the bugfix PR, the release should come in shortly. We're discussing the flag possibility internally and will get back to you on that.

expFlower commented 2 years ago

Amazing, thanks for the prompt feedback @nathantournant! Once the bugfix is merged we will look to validate this in our environment.

nathantournant commented 2 years ago

This isn't closed; we're working on allowing the default value of StaticTargeting to be in the controller config, but that should take a little longer than expected as the clean fix isn't as straightforward as we'd have hoped. Dynamic targeting won't crash as is it did in the mentioned bug though.

expFlower commented 1 year ago

Hi @baluishere has left the project working on this, reviewing this ticket and the latest version of the controller i can say this the issue here is no longer a problem and happy this can be closed from our side. Thanks