containers / nri-plugins

A collection of community maintained NRI plugins
https://containers.github.io/nri-plugins/
Apache License 2.0
67 stars 24 forks source link

Balloon policy: failed to look up current node resource topology CR #52

Closed changzhi1990 closed 1 year ago

changzhi1990 commented 1 year ago

Hi, all

I meet an error during use the balloon policy in NRI. Here is the detailed steps:

ubuntu@xx-icx-2:~/zhi/nri-plugins$ kubectl create -f noderesourcetopology_crd.yaml
ubuntu@xx-icx-2:~/zhi/nri-plugins$ kubectl create -f nri-resource-policy-balloons-deployment.yaml
serviceaccount/nri-resource-policy created
clusterrole.rbac.authorization.k8s.io/nri-resource-policy created
clusterrolebinding.rbac.authorization.k8s.io/nri-resource-policy created
daemonset.apps/nri-resource-policy created
configmap/nri-resource-policy-config created

After that, I check the logs of the NRI pod:

I0526 01:26:21.484969       1 balloons-policy.go:1285]   - pinning monitoring/alertmanager-main-0/config-reloader to cpuset: 0
I0526 01:26:21.484979       1 balloons-policy.go:1293]   - pinning monitoring/alertmanager-main-0/config-reloader to memory 0
I0526 01:26:21.485034       1 balloons-policy.go:250] Balloon default[0]{Cpus: 0; Mems: 0; mCPU used: 1208; capacity: 1000; max. capacity: 111000; pods: [grafana-6c99df9b7-rh4gs tigera-operator-75b96586c9-l267h calico-apiserver-7fc4c49f5d-p8kz6 alertmanager-main-1 prometheus-adapter-54bdfd5865-98jxz prometheus-operator-54bfcfc7b7-gqgs4 calico-apiserver-7fc4c49f5d-wkbqm prometheus-adapter-54bdfd5865-rfcdt calico-typha-d644bc45c-xxsz5 prometheus-k8s-0 calico-node-nkvxb calico-kube-controllers-557cb7fd8b-2xx8x alertmanager-main-0 node-exporter-vcwvq prometheus-k8s-1 kube-state-metrics-6c9bd6f86f-9c25j blackbox-exporter-7fbcd746bc-8frqq kepler-exporter-gjhsx alertmanager-main-2]; conts: [monitoring/grafana-6c99df9b7-rh4gs/grafana tigera-operator/tigera-operator-75b96586c9-l267h/tigera-operator calico-apiserver/calico-apiserver-7fc4c49f5d-p8kz6/calico-apiserver monitoring/alertmanager-main-1/config-reloader monitoring/alertmanager-main-1/alertmanager monitoring/prometheus-adapter-54bdfd5865-98jxz/prometheus-adapter monitoring/prometheus-operator-54bfcfc7b7-gqgs4/prometheus-operator monitoring/prometheus-operator-54bfcfc7b7-gqgs4/kube-rbac-proxy calico-apiserver/calico-apiserver-7fc4c49f5d-wkbqm/calico-apiserver monitoring/prometheus-adapter-54bdfd5865-rfcdt/prometheus-adapter calico-system/calico-typha-d644bc45c-xxsz5/calico-typha monitoring/prometheus-k8s-0/prometheus monitoring/prometheus-k8s-0/config-reloader calico-system/calico-node-nkvxb/calico-node calico-system/calico-kube-controllers-557cb7fd8b-2xx8x/calico-kube-controllers monitoring/alertmanager-main-0/alertmanager monitoring/alertmanager-main-0/config-reloader monitoring/node-exporter-vcwvq/kube-rbac-proxy monitoring/node-exporter-vcwvq/node-exporter monitoring/prometheus-k8s-1/config-reloader monitoring/prometheus-k8s-1/prometheus monitoring/kube-state-metrics-6c9bd6f86f-9c25j/kube-rbac-proxy-self monitoring/kube-state-metrics-6c9bd6f86f-9c25j/kube-rbac-proxy-main monitoring/kube-state-metrics-6c9bd6f86f-9c25j/kube-state-metrics monitoring/blackbox-exporter-7fbcd746bc-8frqq/kube-rbac-proxy monitoring/blackbox-exporter-7fbcd746bc-8frqq/blackbox-exporter monitoring/blackbox-exporter-7fbcd746bc-8frqq/module-configmap-reloader kepler/kepler-exporter-gjhsx/kepler-exporter monitoring/alertmanager-main-2/alertmanager monitoring/alertmanager-main-2/config-reloader]}
I0526 01:26:21.485058       1 resource-manager.go:288] updating topology zone CRDs...
I0526 01:26:21.485069       1 node-resource-topology.go:32] updating node resource topology CR
I0526 01:26:21.485152       1 nri.go:684] <= Synchronize
W0526 01:26:21.487720       1 node-resource-topology.go:64] failed to look up current node resource topology CR: noderesourcetopologies.topology.node.k8s.io "xx-icx-2" is forbidden: User "system:serviceaccount:kube-system:nri-resource-policy" cannot get resource "noderesourcetopologies" in API group "topology.node.k8s.io" at the cluster scope
I0526 01:26:53.186371       1 nri.go:586] => RemoveContainer kube-system/nri-resource-policy-rm7ck:nri-resource-policy-balloons
I0526 01:26:53.186398       1 nri.go:600] <= RemoveContainer
I0526 01:26:53.697159       1 nri.go:548] => RemovePodSandbox kube-system/nri-resource-policy-rm7ck
I0526 01:26:53.697198       1 cache.go:564] removing pod kube-system/nri-resource-policy-rm7ck (940a4c12c96f4852a0569d28daa86c8d7f48044bd84e0fa140e840f2b80d9a4e)
I0526 01:26:53.697215       1 cache.go:1116] saving cache to file '/var/lib/nri-resource-policy/cache'...
I0526 01:26:53.719456       1 nri.go:562] <= RemovePodSandbox

The error message is failed to look up current node resource topology CR: noderesourcetopologies.topology.node.k8s.io "xx-icx-2" is forbidden: User "system:serviceaccount:kube-system:nri-resource-policy" cannot get resource "noderesourcetopologies" in API group "topology.node.k8s.io" at the cluster scope.

But this error doesn't exist in the topology-aware policy.

So does the error impact the balloon policy or we can just ignore it?

fmuyassarov commented 1 year ago

It feels like you are missing Role and RoleBinding.

Usually, when I deploy balloons plugin via kustomize I get Role and Rolebinding as well

$ kustomize build deployment/overlays/balloons/ | kubectl apply -f -

customresourcedefinition.apiextensions.k8s.io/noderesourcetopologies.topology.node.k8s.io created
serviceaccount/nri-resource-policy created
role.rbac.authorization.k8s.io/nri-resource-policy created
clusterrole.rbac.authorization.k8s.io/nri-resource-policy created
rolebinding.rbac.authorization.k8s.io/nri-resource-policy created
clusterrolebinding.rbac.authorization.k8s.io/nri-resource-policy created
configmap/nri-resource-policy-config.default created
daemonset.apps/nri-resource-policy created

And in the generated file you can see the Role and Rolebinding. So, could it be that in the TA plugin, you have the role and rolebinding?

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    api-approved.kubernetes.io: https://github.com/kubernetes/enhancements/pull/1870
    controller-gen.kubebuilder.io/version: v0.11.2
  creationTimestamp: null
  name: noderesourcetopologies.topology.node.k8s.io
spec:
  group: topology.node.k8s.io
  names:
    kind: NodeResourceTopology
    listKind: NodeResourceTopologyList
    plural: noderesourcetopologies
    shortNames:
    - node-res-topo
    singular: noderesourcetopology
  scope: Cluster
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        description: NodeResourceTopology describes node resources and their topology.
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          topologyPolicies:
            items:
              type: string
            type: array
          zones:
            description: ZoneList contains an array of Zone objects.
            items:
              description: Zone represents a resource topology zone, e.g. socket,
                node, die or core.
              properties:
                attributes:
                  description: AttributeList contains an array of AttributeInfo objects.
                  items:
                    description: AttributeInfo contains one attribute of a Zone.
                    properties:
                      name:
                        type: string
                      value:
                        type: string
                    required:
                    - name
                    - value
                    type: object
                  type: array
                costs:
                  description: CostList contains an array of CostInfo objects.
                  items:
                    description: CostInfo describes the cost (or distance) between
                      two Zones.
                    properties:
                      name:
                        type: string
                      value:
                        format: int64
                        type: integer
                    required:
                    - name
                    - value
                    type: object
                  type: array
                name:
                  type: string
                parent:
                  type: string
                resources:
                  description: ResourceInfoList contains an array of ResourceInfo
                    objects.
                  items:
                    description: ResourceInfo contains information about one resource
                      type.
                    properties:
                      allocatable:
                        anyOf:
                        - type: integer
                        - type: string
                        description: Allocatable quantity of the resource, corresponding
                          to allocatable in node status, i.e. total amount of this
                          resource available to be used by pods.
                        pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                        x-kubernetes-int-or-string: true
                      available:
                        anyOf:
                        - type: integer
                        - type: string
                        description: Available is the amount of this resource currently
                          available for new (to be scheduled) pods, i.e. Allocatable
                          minus the resources reserved by currently running pods.
                        pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                        x-kubernetes-int-or-string: true
                      capacity:
                        anyOf:
                        - type: integer
                        - type: string
                        description: Capacity of the resource, corresponding to capacity
                          in node status, i.e. total amount of this resource that
                          the node has.
                        pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                        x-kubernetes-int-or-string: true
                      name:
                        description: Name of the resource.
                        type: string
                    required:
                    - allocatable
                    - available
                    - capacity
                    - name
                    type: object
                  type: array
                type:
                  type: string
              required:
              - name
              - type
              type: object
            type: array
        required:
        - topologyPolicies
        - zones
        type: object
    served: true
    storage: false
  - name: v1alpha2
    schema:
      openAPIV3Schema:
        description: NodeResourceTopology describes node resources and their topology.
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          attributes:
            description: AttributeList contains an array of AttributeInfo objects.
            items:
              description: AttributeInfo contains one attribute of a Zone.
              properties:
                name:
                  type: string
                value:
                  type: string
              required:
              - name
              - value
              type: object
            type: array
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          topologyPolicies:
            description: 'DEPRECATED (to be removed in v1beta1): use top level attributes
              if needed'
            items:
              type: string
            type: array
          zones:
            description: ZoneList contains an array of Zone objects.
            items:
              description: Zone represents a resource topology zone, e.g. socket,
                node, die or core.
              properties:
                attributes:
                  description: AttributeList contains an array of AttributeInfo objects.
                  items:
                    description: AttributeInfo contains one attribute of a Zone.
                    properties:
                      name:
                        type: string
                      value:
                        type: string
                    required:
                    - name
                    - value
                    type: object
                  type: array
                costs:
                  description: CostList contains an array of CostInfo objects.
                  items:
                    description: CostInfo describes the cost (or distance) between
                      two Zones.
                    properties:
                      name:
                        type: string
                      value:
                        format: int64
                        type: integer
                    required:
                    - name
                    - value
                    type: object
                  type: array
                name:
                  type: string
                parent:
                  type: string
                resources:
                  description: ResourceInfoList contains an array of ResourceInfo
                    objects.
                  items:
                    description: ResourceInfo contains information about one resource
                      type.
                    properties:
                      allocatable:
                        anyOf:
                        - type: integer
                        - type: string
                        description: Allocatable quantity of the resource, corresponding
                          to allocatable in node status, i.e. total amount of this
                          resource available to be used by pods.
                        pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                        x-kubernetes-int-or-string: true
                      available:
                        anyOf:
                        - type: integer
                        - type: string
                        description: Available is the amount of this resource currently
                          available for new (to be scheduled) pods, i.e. Allocatable
                          minus the resources reserved by currently running pods.
                        pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                        x-kubernetes-int-or-string: true
                      capacity:
                        anyOf:
                        - type: integer
                        - type: string
                        description: Capacity of the resource, corresponding to capacity
                          in node status, i.e. total amount of this resource that
                          the node has.
                        pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                        x-kubernetes-int-or-string: true
                      name:
                        description: Name of the resource.
                        type: string
                    required:
                    - allocatable
                    - available
                    - capacity
                    - name
                    type: object
                  type: array
                type:
                  type: string
              required:
              - name
              - type
              type: object
            type: array
        required:
        - zones
        type: object
    served: true
    storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nri-resource-policy
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: nri-resource-policy
  namespace: kube-system
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - watch
- apiGroups:
  - topology.node.k8s.io
  resources:
  - noderesourcetopologies
  verbs:
  - create
  - get
  - list
  - update
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: nri-resource-policy
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: nri-resource-policy
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: nri-resource-policy
subjects:
- kind: ServiceAccount
  name: nri-resource-policy
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nri-resource-policy
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nri-resource-policy
subjects:
- kind: ServiceAccount
  name: nri-resource-policy
  namespace: kube-system
---
apiVersion: v1
data:
  policy: |
    ReservedResources:
      cpu: 750m
    #balloons:
kind: ConfigMap
metadata:
  name: nri-resource-policy-config.default
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: nri-resource-policy
  name: nri-resource-policy
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: nri-resource-policy
  template:
    metadata:
      labels:
        app: nri-resource-policy
    spec:
      containers:
      - args:
        - --host-root
        - /host
        - --fallback-config
        - /etc/nri-resource-policy/nri-resource-policy.cfg
        - --pid-file
        - /tmp/nri-resource-policy.pid
        - -metrics-interval
        - 5s
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: nri-resource-policy-balloons:devel
        imagePullPolicy: Always
        name: nri-resource-policy-balloons
        ports:
        - containerPort: 8891
          hostPort: 8891
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        volumeMounts:
        - mountPath: /var/lib/nri-resource-policy
          name: resource-policydata
        - mountPath: /host/sys
          name: hostsysfs
        - mountPath: /var/run/nri-resource-policy
          name: resource-policysockets
        - mountPath: /etc/nri-resource-policy
          name: resource-policyconfig
        - mountPath: /var/run/nri
          name: nrisockets
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccount: nri-resource-policy
      volumes:
      - hostPath:
          path: /var/lib/nri-resource-policy
          type: DirectoryOrCreate
        name: resource-policydata
      - hostPath:
          path: /sys
          type: Directory
        name: hostsysfs
      - hostPath:
          path: /var/run/nri-resource-policy
        name: resource-policysockets
      - configMap:
          name: nri-resource-policy-config
        name: resource-policyconfig
      - hostPath:
          path: /var/run/nri
          type: Directory
        name: nrisockets
changzhi1990 commented 1 year ago

Hi, @fmuyassarov , thanks for your reply.

I have checked the Role and RoleBinding in my env. like that:

ubuntu@xx-icx-2:~/zhi/nri-plugins$ kubectl get role -n kube-system
NAME                                             CREATED AT
configmap-editor                                 2023-05-29T00:26:58Z
extension-apiserver-authentication-reader        2023-05-25T07:11:32Z
kube-proxy                                       2023-05-25T07:11:34Z
kubeadm:kubelet-config-1.23                      2023-05-25T07:11:33Z
kubeadm:nodes-kubeadm-config                     2023-05-25T07:11:33Z
prometheus-k8s                                   2023-05-25T08:37:24Z
system::leader-locking-kube-controller-manager   2023-05-25T07:11:32Z
system::leader-locking-kube-scheduler            2023-05-25T07:11:32Z
system:controller:bootstrap-signer               2023-05-25T07:11:32Z
system:controller:cloud-provider                 2023-05-25T07:11:32Z
system:controller:token-cleaner                  2023-05-25T07:11:32Z
ubuntu@xx-icx-2:~/zhi/nri-plugins$ kubectl get rolebinding -n kube-system
NAME                                                ROLE                                                  AGE
calico-apiserver-auth-reader                        Role/extension-apiserver-authentication-reader        3d17h
configmap-editor-binding                            Role/configmap-editor                                 20m
keda-operator-auth-reader                           Role/extension-apiserver-authentication-reader        2d18h
kube-proxy                                          Role/kube-proxy                                       3d17h
kubeadm:kubelet-config-1.23                         Role/kubeadm:kubelet-config-1.23                      3d17h
kubeadm:nodes-kubeadm-config                        Role/kubeadm:nodes-kubeadm-config                     3d17h
metrics-server-auth-reader                          Role/extension-apiserver-authentication-reader        3d17h
prometheus-k8s                                      Role/prometheus-k8s                                   3d16h
resource-metrics-auth-reader                        Role/extension-apiserver-authentication-reader        3d16h
system::extension-apiserver-authentication-reader   Role/extension-apiserver-authentication-reader        3d17h
system::leader-locking-kube-controller-manager      Role/system::leader-locking-kube-controller-manager   3d17h
system::leader-locking-kube-scheduler               Role/system::leader-locking-kube-scheduler            3d17h
system:controller:bootstrap-signer                  Role/system:controller:bootstrap-signer               3d17h
system:controller:cloud-provider                    Role/system:controller:cloud-provider                 3d17h
system:controller:token-cleaner                     Role/system:controller:token-cleaner                  3d17h

I can't find any Role or RoleBinding about NRI. But I have created the noderesourcetopology_crd.yaml and the nri-resource-policy-balloons-deployment.yaml.

ubuntu@xx-2:~/zhi/nri-plugins$ kubectl create -f noderesourcetopology_crd.yaml
Error from server (AlreadyExists): error when creating "noderesourcetopology_crd.yaml": customresourcedefinitions.apiextensions.k8s.io "noderesourcetopologies.topology.node.k8s.io" already exists
ubuntu@xx-icx-2:~/zhi/nri-plugins$ kubectl create -f nri-resource-policy-balloons-deployment.yaml
Error from server (AlreadyExists): error when creating "nri-resource-policy-balloons-deployment.yaml": serviceaccounts "nri-resource-policy" already exists
Error from server (AlreadyExists): error when creating "nri-resource-policy-balloons-deployment.yaml": clusterroles.rbac.authorization.k8s.io "nri-resource-policy" already exists
Error from server (AlreadyExists): error when creating "nri-resource-policy-balloons-deployment.yaml": clusterrolebindings.rbac.authorization.k8s.io "nri-resource-policy" already exists
Error from server (AlreadyExists): error when creating "nri-resource-policy-balloons-deployment.yaml": daemonsets.apps "nri-resource-policy" already exists

Do I need to do something more?

fmuyassarov commented 1 year ago

I'm trying to reproduce the issue, will let you know soon.

changzhi1990 commented 1 year ago

I'm trying to reproduce the issue, will let you know soon.

Got it, thanks!

changzhi1990 commented 1 year ago

As fmuyassarov has uploaded an PR and close this issue.