I'm not sure if it's appropriate to post this issue here, since node-feature-discovery is maintained from kubernetes-sigs. If you think it's not suitable to post it here, please let me know.
Recently I got several alerts from K8S cluster which describes that API server tooks so long time to serve a LIST request from gpu-operator. Here's the alert and rule that I'm using:
Alert:
Long API server 99%-tile Latency
LIST: 29.90 seconds while nfd.k8s-sigs.io/v1alpha1/nodefeatures request.
Rule: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!~"(log|exec|portforward|proxy)",verb!~"^(?:CONNECT|WATCHLIST|WATCH)$"} [10m])) WITHOUT (instance)) > 10
I also found all gpu-operator-node-feature-discovery-worker pods are tried to send GET verb to API server to query the nodefeatures resource (assumed that this pod needed to get information about node labels). Here's the part of audit log:
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"df926f36-8c1f-488e-ac88-11690e24660a","stage":"ResponseComplete","requestURI":"/apis/nfd.k8s-sigs.io/v1alpha1/namespaces/gpu-operator/nodefeatures/sra100-033","verb":"get","user":{"username":"system:serviceaccount:gpu-operator:node-feature-discovery","uid":"da2306ea-536f-455d-bf18-817299dd5489","groups":["system:serviceaccounts","system:serviceaccounts:gpu-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["gpu-operator-node-feature-discovery-worker-49qq6"],"authentication.kubernetes.io/pod-uid":["65dfb997-221e-4a5c-92df-7ff111ea6137"]}},"sourceIPs":["75.17.103.53"],"userAgent":"nfd-worker/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"nodefeatures","namespace":"gpu-operator","name":"sra100-033","apiGroup":"nfd.k8s-sigs.io","apiVersion":"v1alpha1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2024-08-07T01:35:20.355504Z","stageTimestamp":"2024-08-07T01:35:20.676700Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"gpu-operator-node-feature-discovery\" of ClusterRole \"gpu-operator-node-feature-discovery\" to ServiceAccount \"node-feature-discovery/gpu-operator\""}}
I think this is strange that it takes this long to process LIST requests when my k8s cluster only has 300 GPU nodes and why node-feature-discovery-worker pods are sending GET request every minute.
Do you have any information about this problem?
If there are any parameters that can be changed or if you could provide any ideas, I would be very grateful.
Thanks!
3. Steps to reproduce the issue
4. Information to attach (optional if deemed irrelevant)
Logs from gpu-operator-node-feature-discovery-worker pods - query every minute
1. Quick Debug Information
2. Issue or feature description
Hello, NVIDIA
gpu-operator
team.I'm not sure if it's appropriate to post this issue here, since
node-feature-discovery
is maintained from kubernetes-sigs. If you think it's not suitable to post it here, please let me know.Recently I got several alerts from K8S cluster which describes that API server tooks so long time to serve a
LIST
request fromgpu-operator
. Here's the alert and rule that I'm using:histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!~"(log|exec|portforward|proxy)",verb!~"^(?:CONNECT|WATCHLIST|WATCH)$"} [10m])) WITHOUT (instance)) > 10
I also found all
gpu-operator-node-feature-discovery-worker
pods are tried to sendGET
verb to API server to query thenodefeatures
resource (assumed that this pod needed to get information about node labels). Here's the part of audit log:I think this is strange that it takes this long to process
LIST
requests when my k8s cluster only has 300 GPU nodes and whynode-feature-discovery-worker
pods are sendingGET
request every minute.Do you have any information about this problem? If there are any parameters that can be changed or if you could provide any ideas, I would be very grateful.
Thanks!
3. Steps to reproduce the issue
4. Information to attach (optional if deemed irrelevant)
gpu-operator-node-feature-discovery-worker
pods - query every minute