fluxninja / aperture

Rate limiting, caching, and request prioritization for modern workloads
https://docs.fluxninja.com
Apache License 2.0
642 stars 25 forks source link

Controller trying to interact with a compacted etcd key revision #2477

Closed kklimonda-fn closed 1 year ago

kklimonda-fn commented 1 year ago

We have etcd auto compaction enabled, and today one of our test controllers stopped reporting policy. When we've looked into its logs, the following was found:

{"level":"error","service":"aperture-controller","error":"etcdserver: mvcc: required revision has been compacted","time":"2023-08-07T07:21:21Z","caller":"etcd/watcher/watcher.go79","message":"Etcd watch channel was canceled"}
{"level":"error","service":"aperture-controller","error":"etcdserver: mvcc: required revision has been compacted","time":"2023-08-07T07:21:21Z","caller":"etcd/watcher/watcher.go79","message":"Etcd watch channel was canceled"}
{"level":"error","service":"aperture-controller","error":"etcdserver: mvcc: required revision has been compacted","time":"2023-08-07T07:21:21Z","caller":"etcd/watcher/watcher.go79","message":"Etcd watch channel was canceled"}
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*PolicyFactory).trackPolicy.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*PolicyFactory).trackPolicy)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*PolicyFactory).trackPolicy.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*PolicyFactory).trackPolicy ran successfully in 2.08µs
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/runtime.(*Circuit).setup.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane/runtime.(*Circuit).setup)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/runtime.(*Circuit).setup.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane/runtime.(*Circuit).setup ran successfully in 112.336µs
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupDynamicConfig.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupDynamicConfig)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupDynamicConfig.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupDynamicConfig ran successfully in 2.227µs
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupCircuitJob.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupCircuitJob)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupCircuitJob.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane.(*Policy).setupCircuitJob ran successfully in 11.556µs
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/rate-limiter.(*rateLimiterSync).setupSync.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/rate-limiter.(*rateLimiterSync).setupSync)
{"level":"info","service":"aperture-controller","root":"system","system":"key","key":"system","event":"Event<EventType: Remove | service1-demo-app>","time":"2023-08-07T07:21:21Z","caller":"pkg/notifiers/fx-driver.go48","message":"key removed"}
{"level":"error","service":"aperture-controller","error":"etcdserver: mvcc: required revision has been compacted","time":"2023-08-07T07:21:21Z","caller":"etcd/watcher/watcher.go79","message":"Etcd watch channel was canceled"}
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/rate-limiter.(*rateLimiterSync).setupSync.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/rate-limiter.(*rateLimiterSync).setupSync ran successfully in 298.092928ms
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*Actuator).setupWriter.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*Actuator).setupWriter)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*Actuator).setupWriter.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*Actuator).setupWriter ran successfully in 9.842276ms
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/query/promql.ModuleForPolicyApp.func1.2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/query/promql.ModuleForPolicyApp.func1)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/query/promql.ModuleForPolicyApp.func1.2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/query/promql.ModuleForPolicyApp.func1 ran successfully in 320.866µs
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*loadSchedulerConfigSync).doSync.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*loadSchedulerConfigSync).doSync)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*loadSchedulerConfigSync).doSync.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane/components/flowcontrol/load-scheduler.(*loadSchedulerConfigSync).doSync ran successfully in 9.236302ms
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/classifier.(*classifierConfigSync).doSync.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/classifier.(*classifierConfigSync).doSync)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/classifier.(*classifierConfigSync).doSync.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/classifier.(*classifierConfigSync).doSync ran successfully in 34.70209ms
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/fluxmeter.(*fluxMeterConfigSync).doSync.func2() executing (caller: github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/fluxmeter.(*fluxMeterConfigSync).doSync)
[Fx] HOOK OnStop        github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/fluxmeter.(*fluxMeterConfigSync).doSync.func2() called by github.com/fluxninja/aperture/v2/pkg/policies/controlplane/resources/fluxmeter.(*fluxMeterConfigSync).doSync ran successfully in 29.936897ms
{"level":"error","service":"aperture-controller","error":"etcdserver: no leader","time":"2023-08-07T07:24:05Z","caller":"etcd/watcher/watcher.go79","message":"Etcd watch channel was canceled"}
krdln commented 1 year ago

cc @DariaKunoichi Preliminary analysis results:

I will fix this even before #2464, but since it was one-off and we know a workaround (restart), it's low priority.

krdln commented 1 year ago

@kklimonda-fn
I think the exact error might reoccur in the future (as I'm not 100% of the cause, it smells a bit like etcd brain-split), but with #2502 controller should at least be able to recover. Anyway, let's close and investigate if it reoccurs