NetApp / trident

Storage orchestrator for containers
Apache License 2.0
762 stars 222 forks source link

Could not update Trident controller with node registration. Slow Trident CSI controller. #857

Closed dmpo1 closed 2 weeks ago

dmpo1 commented 1 year ago

Hi guys, I have rather old version of Trident 21.10.0 and kubernetes 1.23.10, but it was working well up until yesterday when it stopped :)

We have a k8s cluster with Trident and OnTap NetApp used as NFS storage. The problem is that after restart of daemonset trident-csi pods , it takes minutes or hours for them to fully start. The pods can't register nodes in the controller (which I've tried to restart as well).

time="2023-10-11T10:01:24Z" level=debug msg="\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\nPUT https://172.17.51.66:34571/trident/v1/node/xxxx\nHeaders: map[Content-Type:[application/json] X-Request-Id:[24dcb56a-947f-4648-ae73-55ff76a23e75]]\nBody: {\n  \"name\": \"xxxx\",\n  \"ips\": [\n    \"x.x.x.x\",\n    \"x.x.x.x\",\n    \"172.16.1.0\",\n    \"172.17.0.1\"\n  ],\n  \"nodePrep\": {\n    \"enabled\": false\n  }\n}\n--------------------------------------------------------------------------------" requestID=24dcb56a-947f-4648-ae73-55ff76a23e75 requestSource=Internal
time="2023-10-11T10:01:54Z" level=warning msg="Could not update Trident controller with node registration, will retry." error="could not log into the Trident CSI Controller: error communicating with Trident CSI Controller; Put \"https://172.17.51.66:34571/trident/v1/node/xxxx\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" increment=4.763368615s requestID=24dcb56a-947f-4648-ae73-55ff76a23e75 requestSource=Internal

In debug logs of the trident-main container in the controller I see that it takes enormous time to complete the requests, like here it took 2 minutes and returned 400 status :

time="2023-10-11T10:03:21Z" level=debug msg="REST API call complete." duration=1m57.453167471s method=PUT requestID=24dcb56a-947f-4648-ae73-55ff76a23e75 requestSource=REST route=AddOrUpdateNode status_code=400 uri=/trident/v1/node/xxx

But after a while some pods (in the example bellow it is another pod/node) managed to register in the same controllers, here the controller processed the request in 25 seconds:

time="2023-10-11T12:16:33Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=yyy requestID=b8c4813b-069d-48f9-a005-03f5da61a061 requestSource=REST
time="2023-10-11T12:16:33Z" level=debug msg="REST API call complete." duration=25.81268769s method=PUT requestID=b8c4813b-069d-48f9-a005-03f5da61a061 requestSource=REST route=AddOrUpdateNode status_code=201 uri=/trident/v1/node/yyy

Also if I run tridentctl get backends -n trident the command hangs forever. At the same time I can list detail about backend using kubectl get tridentbackend -n trident

I don't see that the node where the controller is running of the controller's pod is having any performance issues. ETCD database is relatively busy (300mb in size) but the control plane nodes have plenty CPU and memory resources.

Does anyone know what could cause such slowness of the controller (and 400 status)?

Thanks a lot!

AndreasDeCrinis commented 1 month ago

have you managed to solve the issue?

sjpeeris commented 3 weeks ago

hi @dmpo1 Can you please let us know if this has been resolved ?

sjpeeris commented 2 weeks ago

Closing. Please re-open if you notice this issue with newer versions of Trident.