Azure / fleet

Multi-cluster core
MIT License
80 stars 25 forks source link

[BUG] Hard crash and reboot of the hubagent if ClusterResourceSnapshot failed to list due to timeout #717

Open d4rkhunt33r opened 8 months ago

d4rkhunt33r commented 8 months ago

Describe the bug

Hard crash and reboot of the hubagent if ClusterResourceSnapshot failed to list due to timeout

Environment

Please provide the following:

The hubagent was installed with the following values

replicaCount: 1
logVerbosity: 1
enableWebhook: false
webhookServiceName: fleetwebhook
enableGuardRail: false
webhookClientConnectionType: service
enableV1Alpha1APIs: false
enableV1Beta1APIs: true
resources:
  requests:
    cpu: 2
    memory: 8Gi
  limits:
    cpu: 4
    memory: 16Gi

To Reproduce

Steps to reproduce the behavior:

You should see the hubagent container to reboot showing the following errores in the logs


I0308 16:31:41.079738       1 controller/controller.go:190] "Starting controller" controller="cluster-resource-placement-controller-v1beta1"
I0308 16:31:41.147758       1 informer/informermanager.go:152] "Disabled an informer for a disappeared resource" res={"GroupVersionKind":{"Group":"","Version":"v1","Kind":"Event"},"GroupVersionResource":{"Group":"","Version":"v1","Resource":"events"},"IsClusterScoped":false}
W0308 16:32:21.369841       1 cache/reflector.go:535] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list *v1beta1.ClusterResourceSnapshot: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io)
I0308 16:32:21.369941       1 trace/trace.go:236] Trace[1547232315]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229 (08-Mar-2024 16:31:21.299) (total time: 60070ms):
Trace[1547232315]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io) 60069ms (16:32:21.369)
Trace[1547232315]: [1m0.070028406s] [1m0.070028406s] END
E0308 16:32:21.369964       1 cache/reflector.go:147] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Failed to watch *v1beta1.ClusterResourceSnapshot: failed to list *v1beta1.ClusterResourceSnapshot: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io)```

### **Expected behavior**
The pod should not be rebooted

### **Screenshots**
If applicable, add screenshots to help explain your problem.

### **Additional context**
I think this could be solved if there is an option to increase the timeout for the kubernetes client.

Sorry for my english :S
Haladinoq commented 8 months ago

at the moment I have the same issue in my implementation

ryanzhang-oss commented 8 months ago

@d4rkhunt33r Thank you for reporting this. Can you share a bit more on the environment setup? Did you use our helm chart? How much cpu/memory did the hubagent use when you create "a lot of objects"? What is the limit of the cpu/memory on the EKS api-server?

@Haladinoq I wonder what do you mean by your "in your implementation"?

ryanzhang-oss commented 8 months ago

@d4rkhunt33r You can increase the hubagent performance by adding the following parameters in your helm chart with some value like below.

--set concurrentClusterPlacementSyncs=20 \ --set hubAPIQPS=10 \ --set hubAPIBurst=100 \

d4rkhunt33r commented 8 months ago

@ryanzhang-oss thank you so much for you assitance. i will try to increase those values. but i think the issue is more related to the lack of an option to specify a timeout for the queries to the kubernetes api server

you can check in the message from the logs that the error is related to the ClusterResourceSnapshot was unable to return a reponse

[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:
    >  goroutine 88 [running]:
    >  runtime/debug.Stack()
    >   /usr/local/go/src/runtime/debug/stack.go:24 +0x65
    >  sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
    >   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/log/log.go:60 +0xcd
    >  sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error(0xc0000b2140, {0x1c0bb20, 0xc04ae8dec0}, {0x19bcdd2, 0x20}, {0x0, 0x0, 0x0})
    >   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/log/deleg.go:139 +0x68
    >  github.com/go-logr/logr.Logger.Error({{0x1c27280?, 0xc0000b2140?}, 0x444091?}, {0x1c0bb20, 0xc04ae8dec0}, {0x19bcdd2, 0x20}, {0x0, 0x0, 0x0})
    >   /go/pkg/mod/github.com/go-logr/logr@v1.3.0/logr.go:305 +0xda
    >  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1({0x1c24878?, 0xc00059eaa0?}, 0xc00037e960, {0x1c16440, 0xc0004f7230})
    >   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:203 +0x186
    >  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2(0xc00037e960, {0x1c24878?, 0xc00059eaa0}, 0xc00062e640)
    >   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:208 +0x418
    >  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start(0xc00037e960, {0x1c24878, 0xc00059eaa0})
    >   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:234 +0x165
    >  sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0000328e0)
    >   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:223 +0xdb
    >  created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile
    >   /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/manager/runnable_group.go:207 +0x1ad
I0311 22:03:09.868763       1 controller/controller.go:218] "Shutdown signal received, waiting for all workers to finish" controller="cluster-resource-placement-controller-v1beta1"
d4rkhunt33r commented 8 months ago

This is another message that i think can be usefull


E0312 15:09:47.679656       1 workgenerator/controller.go:115] "Failed to sync all the works" err="unexpected behavior which cannot be handled by the controller: Timeout: failed waiting for *v1beta1.ClusterResourceSnapshot Informer to sync" resourceBinding="namespaces-resource-placem-istio-replica-2bfd58b0"```
ryanzhang-oss commented 8 months ago

Thanks. I wonder if you can share the entire log and cluster settings? 30 seconds is a very long time so I am not sure we should relax that. It seems that the API-server on the hub cluster is very over-loaded? Can you share its usage?

d4rkhunt33r commented 7 months ago

Hi i am sorry for the delay but I was finishing some task for the Q. Currently I am not able to share the entire log because my employer have concerns that some names could give away the name of my employer. What I can say Is that we have a cluster with hundreds of thousands of objects mainly Istio CRDS. What we want to achieve is be able to have a backup of all those objects in a secondary cluster.

The reason the Api takes so long to retrieve the ClusterResourceSnapshot is the amount of objects and the load on the server (Very busy one).

I think it will be very helpful to have a parameter to control the timeout before the controller fails due to not having a response from the server.

Next week I will try to look at the code and see if I can find the place where this could be implemented and study a little more the project to see if this is already posible and I am missing something

Sorry for my English is not my native language

d4rkhunt33r commented 6 months ago

@ryanzhang-oss Good morning I hope you are having a great day.

I am still looking at the issue and I was not able to found a way to increase the Kubernetes client timeout neither the CacheSyncTimeout

I was hopping this could be added to the project in case someone have an already large cluster and the CacheSync could take more than two minutos

// CacheSyncTimeout refers to the time limit set to wait for syncing caches. // Defaults to 2 minutes if not set. // +optional CacheSyncTimeout *time.Durationjson:"cacheSyncTimeout,omitempty"``

d4rkhunt33r commented 6 months ago

i was able to change the CacheSysncTimeout by adding. mgr, err := ctrl.NewManager(config, ctrl.Options{ Scheme: scheme, Controller: ctrloption.Controller{ CacheSyncTimeout: 300 * time.Second, },

but I am still seeing the error related to the Timeout listing resources.

E0418 20:02:25.727800 1 controller/controller.go:75] "Error returned by the API server" err="the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io)" fromCache=false reason="Timeout".

is there a place in the code where I can change the 60 seconds timeout for requests to the Kubernetes API server ?

d4rkhunt33r commented 6 months ago

Hi i was able to find the issue.

If you create a ClusterResourcePlacement selecting a lot of namespaces to propagate, the resulting ClusterResourceSnapshots are too large for my kuberntes cluster to parse.

That means the timeout was server side as you can see in the following log from my Kubernetes cluster.

{
    "encodeGV": "placement.kubernetes-fleet.io/v1beta1",
    "encoder": "{\"name\":\"json\",\"pretty\":\"false\",\"strict\":\"false\",\"yaml\":\"false\"}",
    "name": "versioning"
}
 (23-Apr-2024 15:48:53.028) (total time: 60095ms):
Trace[372133574]: ---"About to start writing response" size:2273981422 58933ms (15:49:51.962)
Trace[372133574]: ---"Write call failed" writer:*gzip.Writer,size:2273981422,firstWrite:true,err:http: Handler timeout 1161ms (15:49:53.123)
Trace[372133574]: [1m0.095898857s] [1m0.095898857s] END

It is basically a timeout parsing the json output for the request to cluster resourcesnapshot

To fix this, I changed the pagination from 500 ClusterResourceSnapshots to 100 but then I found another issue because the ClusterResourcePlacement was sooo large that updating the object failed due to the limit of bytes to the post operation

I still think it could be a good idea to change the CacheSyncTimeout for the manager in order to give it more time to SyncCaches

My code is this

    mgr, err := ctrl.NewManager(config, ctrl.Options{
        Scheme: scheme,
        Controller: ctrloption.Controller{
            CacheSyncTimeout: 300 * time.Second,
        },
        Cache: cache.Options{
            SyncPeriod: &opts.ResyncPeriod.Duration,
            HTTPClient: cacheHttpClient,
        },
        LeaderElection:             opts.LeaderElection.LeaderElect,
        LeaderElectionID:           opts.LeaderElection.ResourceName,
        LeaderElectionNamespace:    opts.LeaderElection.ResourceNamespace,
        LeaderElectionResourceLock: opts.LeaderElection.ResourceLock,
        HealthProbeBindAddress:     opts.HealthProbeAddress,
        Metrics: metricsserver.Options{
            BindAddress: opts.MetricsBindAddress,
        },
        WebhookServer: ctrlwebhook.NewServer(ctrlwebhook.Options{
            Port:    FleetWebhookPort,
            CertDir: FleetWebhookCertDir,
        }),
    })

adding CacheSyncTimeout: 300 * time.Second,

Allowed the controllers to have more time to finish syncing caches and no more restart happened

d4rkhunt33r commented 6 months ago

So my next step is to create ClusterResourcePlacement for each namespace in the hopes to not crash the server due to the size of the Snapshots

zhiying-lin commented 6 months ago

@d4rkhunt33r thank you for the updates. Just curious how you're using CRP. Are you selecting multiple applications using one CRP?

It's better to put one application configurations into one CRP, so that the size of selecting resources is under the control.

I still think it could be a good idea to change the CacheSyncTimeout for the manager in order to give it more time to SyncCaches

i think it's a fair request.

ryanzhang-oss commented 6 months ago

CRP is not really designed for this type of scenario unless you are placing the resource to a different cluster each time. I wonder if you might have looked at etcd backup?

ryanzhang-oss commented 6 months ago

So my next step is to create ClusterResourcePlacement for each namespace in the hopes to not crash the server due to the size of the Snapshots

Hi, @d4rkhunt33r, I wonder if you will place each namespace on a different cluster or just copy all things on cluster A to cluster B?

d4rkhunt33r commented 6 months ago

@ryanzhang-oss yes my objective is to maintain two clusters with pretty much all namespaces and its objects synchronized in order to have an Active/Active backup of my Istio Control Plane.

I was able to create the 9000 ClusterResourcePlacement and the process to sync the objects between clusters took 5 hours for the first run

Pretty much all objects were synchronized correctly and it takes about a minute te reflect the changes in a cluster to the replica after a change

Now I am curious about the observability of the project.

Not sure if there is way to know using metrics if both clusters are in sync. or what objets were not synced

d4rkhunt33r commented 6 months ago

@d4rkhunt33r thank you for the updates. Just curious how you're using CRP. Are you selecting multiple applications using one CRP?

It's better to put one application configurations into one CRP, so that the size of selecting resources is under the control.

I still think it could be a good idea to change the CacheSyncTimeout for the manager in order to give it more time to SyncCaches

i think it's a fair request.

yes initially I was selecting 9000 namespaces and all its objects in 1 ClusterResourcePlacement and it broke the app :).

Now I am creating 1 ClusterResourcePlacement for each Namespace and the Aplication was able to handle the process.

about etcd backups, I am using another open source project called velero. the thing is that a backup gets old very quickly and Fleet is something I am hoping can give me something more real time o semi real time