AliyunContainerService / kube-eventer

kube-eventer emit kubernetes events to sinks
Apache License 2.0
1k stars 275 forks source link

在容器发生OOMKilling时,如何让node-problen-detector向apiServer发送event时添加pod信息,以便获取到具体的pod发生OOM #219

Closed lee-lib closed 1 year ago

lee-lib commented 2 years ago

参考阿里容器服务ack文档 https://help.aliyun.com/knowledge_detail/178479.html 文档中描述在2020年07月的镜像版本registry.aliyuncs.com/acs/node-problem-detector:v0.6.3-28-160499f中就可以为oomkilling事件添加pod信息, 我这边是按照此版本的node-problem-detector镜像构建的容器,但是模拟触发oomkill事件时,还是无法获取到pod信息,只能获取到node类型的信息,yaml文件如下 apiVersion: apps/v1 kind: DaemonSet metadata: name: node-problem-detector namespace: kube-system labels: app: node-problem-detector spec: selector: matchLabels: app: node-problem-detector template: metadata: labels: app: node-problem-detector spec: containers:

模拟oom产生的日志如下,其中involvedObject.kind信息还是Node,无法获取到Pod信息 I0214 18:22:00.008987 1 mysql.go:73] { "metadata": { "name": "k8s-master.16d39fea6f67823e", "namespace": "default", "selfLink": "/api/v1/namespaces/default/events/k8s-master.16d39fea6f67823e", "uid": "9c9cdb3b-240a-40e8-9db2-b4490dfc4f42", "resourceVersion": "8981033", "creationTimestamp": "2022-02-14T10:21:58Z", "managedFields": [ { "manager": "node-problem-detector", "operation": "Update", "apiVersion": "v1", "time": "2022-02-14T10:21:58Z", "fieldsType": "FieldsV1", "fieldsV1": { "f:count": {}, "f:firstTimestamp": {}, "f:involvedObject": { "f:kind": {}, "f:name": {}, "f:uid": {} }, "f:lastTimestamp": {}, "f:message": {}, "f:reason": {}, "f:source": { "f:component": {}, "f:host": {} }, "f:type": {} } } ] }, "involvedObject": { "kind": "Node", "name": "k8s-master", "uid": "k8s-master" }, "reason": "OOMKilling", "message": "Memory cgroup out of memory: Kill process 2235 (stress) score 0 or sacrifice child\nKilled process 2235 (stress) total-vm:515612kB, anon-rss:168728kB, file-rss:32kB, shmem-rss:0kB", "source": { "component": "kernel-monitor", "host": "k8s-master" }, "firstTimestamp": "2022-02-14T10:21:58Z", "lastTimestamp": "2022-02-14T10:21:58Z", "count": 1, "type": "Warning", "eventTime": null, "reportingComponent": "", "reportingInstance": "" }

请问,开发者在容器发生后OOMKilling时,如何配置node-problem-detector.yaml和kube-eventer.yaml文件才能获取到Pod信息?

ringtail commented 2 years ago

阿里云的这个功能是在NPD中实现的,目前社区的OOM是在Node维度透出Warning事件。

crushCoin commented 2 years ago

https://github.com/AliyunContainerService/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L67 这里pod uid正则的原因 Cgroup Driver 为 systemd 可以;Cgroup Driver 为 cgroupfs 不行 修改源码重新编译(git clone -b alibabacloud-v0.8.10 https://githubfast.com/AliyunContainerService/node-problem-detector.git) 1、修改为 uuidRegx = regexp.MustCompile("[0-9a-f]{8}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{4}[0-9a-f]{12}|[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}") 2、https://github.com/AliyunContainerService/node-problem-detector/blob/master/pkg/systemlogmonitor/log_monitor.go#L216
修改为 (可以将 PodOOMKilling 和 OOMKilling 日志放在一起) message = fmt.Sprintf("pod was OOM killed. node:%s pod:%s namespace:%s uuid:%s\n", pod.Spec.NodeName, pod.Name, pod.Namespace, uuid) + message 3、修改 kernel-monitor.json ("bufferSize": 30,这个是用于日志匹配的 环形队列缓冲buffer 长度,可以适当改大) rules 中 增加 { "type": "temporary", "reason": "PodOOMKilling", "pattern": "Task in /kubepods(.+) killed as a result of limit of .(\n.)+ Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*" }

上面有转义 看图 企业微信截图_16522332592620