Tencent / caelus

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs
Other
344 stars 83 forks source link

ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory #26

Closed MrFireChow closed 2 years ago

MrFireChow commented 2 years ago

Hi,i would like to ask for help that when i deploy caelus on k8s, pods of caelus show the following logs: I1208 17:04:24.202219 7830 feature_gate.go:243] feature gates: &{map[]} I1208 17:04:24.202253 7830 types.go:490] current namespace is NOT host E1208 17:04:24.208590 7830 cpubt.go:56] checking BT file(cpu.offline) err: stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory I1208 17:04:24.208616 7830 types.go:708] cpu isolate auto detect is enabled, chosen manage policy is: quota W1208 17:04:24.208624 7830 types.go:745] adding non-host namespace prefix for kubelet root dir F1208 17:04:24.208639 7830 types.go:724] cpu manager file(/rootfs/data/cpu_manager_state) err: open /rootfs/data/cpu_manager_state: no such file or directory

I don't know if it is my miss of some steps?

MrFireChow commented 2 years ago

I modify “kubelet_root_dir” ,then the last error(cannot find cpu_manager_state) is fixed, but here comes another problem:

F1209 16:21:09.017311 144120 health_check.go:70] failed init health check config: invalid character ']' looking for beginning of value

I would like to know which config file was checked at this step, please give me a hand, thanks a lot!

ddongchen commented 2 years ago

hi, The error "stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory" is not very important, for I found the cpu policy has been replaced as "quota", this is Automatic. For "failed init health check config: invalid character ']' looking for beginning of value", you can download the latest code and have a try. Please let me konw if is working.

MrFireChow commented 2 years ago

@ddongchen The latest code works for the error "failed init health check config: invalid character ']' looking for beginning of value", however, another problem occurs:

I1211 10:58:22.200793 164324 manager.go:1158] Started watching for new ooms in manager W1211 10:58:22.200812 164324 manager.go:256] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory I1211 10:58:22.210798 164324 manager.go:272] Starting recovery of all containers I1211 10:58:22.265747 164324 manager.go:277] Recovery completed F1211 10:58:22.273235 164324 cgroup.go:259] cadvisor manager start err: inotify_add_watch /sys/fs/cgroup/cpuacct,cpu: no such file or directory

MrFireChow commented 2 years ago

@ddongchen I find "/sys/fs/cgroup/cpu,cpuacct" in my environment instead of "/sys/fs/cgroup/cpuacct,cpu", then in golang container, i find "/sys/fs/cgroup/cpuacct,cpu", does it matter?

MrFireChow commented 2 years ago

I find this is a bug of cadvisor, and i fix it by: 1.mount -o remount,rw '/sys/fs/cgroup' 2.ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu

In addition, i suppose that the version of cadvisor needs to be updated.