CentaurusInfra / arktos

Arktos for large-scale cloud platform
Apache License 2.0
245 stars 69 forks source link

[Scale out POC] secret not found in kubelet [system tenant only] #1052

Open Sindica opened 3 years ago

Sindica commented 3 years ago

What happened: In local scale out test, perf test pods sometimes can only being created in one RP: Failed RP kubelet log:

I0325 22:11:01.116552    3687 reflector.go:218] Starting reflector *v1.Secret (0s) from object-"system"/"ftblf3-testns"/"default-token-r8stb"
I0325 22:11:01.116586    3687 reflector.go:293] ListAndWatch *v1.Secret. filter bounds []. name object-"system"/"ftblf3-testns"/"default-token-r8stb". Watch page size 0. resync period 0s
E0325 22:11:01.384617    3687 secret.go:199] Couldn't get secret system/ftblf3-testns/default-token-r8stb: secret "default-token-r8stb" not found

Successful RP kubelet log:

I0325 22:11:01.112580    3164 reflector.go:218] Starting reflector *v1.Secret (0s) from object-"system"/"ftblf3-testns"/"default-token-r8stb"
I0325 22:11:01.112615    3164 reflector.go:293] ListAndWatch *v1.Secret. filter bounds []. name object-"system"/"ftblf3-testns"/"default-token-r8stb". Watch page size 0. resync period 0s
I0325 22:11:01.355145    3164 secret.go:212] Received secret system/ftblf3-testns/default-token-r8stb containing (3) pieces of data, 2222 total bytes

What you expected to happen: Both RPs should be able to get secret and create running pods.

How to reproduce it (as minimally and precisely as possible): ./hack/arktos-up-scale-out-poc.sh

Anything else we need to know?: The following file might need additional code changes: . pkg/controller/volume/scheduling/scheduler_binder.go:477 . pkg/kubelet/kubelet.go . pkg/scheduler/factory/factory.go (Refactor AggregateNodeLister) . pkg/scheduler/nodeinfo/util.go (// TODO - set rpId in NodeInfo) . pkg/scheduler/scheduler.go (sched.config.NodeListers[0])

Environment:

Sindica commented 3 years ago

This is for system tenant. I will test in tenant arktos and zeta and see whether it happens again.

Sindica commented 3 years ago

I ran load test locally with tenant arktos and zeta, Both were able to schedule pods onto both RPs. Suspect this is a system tenant only issue. As we don't have a concrete plan to deal with system tenant objects, lowering the priority.