Open jchodera opened 8 years ago
Perhaps this is normal? A number of nodes show high kipmi0
CPU usage:
[chodera@mskcc-ln1 ~/scripts]$ ./check-nodes-for-load.tcsh
gpu-1-4
root 2393 54.3 0.0 0 0 ? RN Mar06 33621:07 [kipmi0]
gpu-1-5
root 2412 22.5 0.0 0 0 ? SN Mar06 13979:21 [kipmi0]
gpu-1-6
root 2410 48.8 0.0 0 0 ? RN Mar06 30323:36 [kipmi0]
gpu-1-7
root 2406 88.2 0.0 0 0 ? RN Mar06 54733:53 [kipmi0]
gpu-1-8
root 2391 10.1 0.0 0 0 ? RN Mar06 6277:19 [kipmi0]
gpu-1-9
gpu-1-10
root 2400 0.0 0.0 0 0 ? SN Mar06 18:41 [kipmi0]
gpu-1-11
root 2406 24.6 0.0 0 0 ? SN Mar06 15288:21 [kipmi0]
gpu-1-12
root 2379 0.0 0.0 0 0 ? SN Mar06 16:41 [kipmi0]
gpu-1-13
root 2409 22.9 0.0 0 0 ? SN Mar06 14194:35 [kipmi0]
gpu-1-14
root 2390 69.6 0.0 0 0 ? RN Mar06 43205:29 [kipmi0]
gpu-1-15
root 2395 17.9 0.0 0 0 ? RN Mar06 11128:12 [kipmi0]
gpu-1-16
root 2414 0.4 0.0 0 0 ? SN Mar06 270:48 [kipmi0]
gpu-1-17
root 2396 0.0 0.0 0 0 ? SN Mar06 18:23 [kipmi0]
gpu-2-4
root 2383 0.0 0.0 0 0 ? SN Mar06 15:59 [kipmi0]
gpu-2-5
root 2403 24.1 0.0 0 0 ? RN Mar06 14953:05 [kipmi0]
gpu-2-6
root 2407 6.6 0.0 0 0 ? SN Mar06 4096:22 [kipmi0]
gpu-2-7
root 2416 90.7 0.0 0 0 ? RN Mar06 56203:51 [kipmi0]
gpu-2-8
root 2446 72.2 0.0 0 0 ? RN Mar30 19923:10 [kipmi0]
gpu-2-9
gpu-2-10
root 2405 0.0 0.0 0 0 ? SN Mar06 21:07 [kipmi0]
gpu-2-11
gpu-2-12
root 2421 36.3 0.0 0 0 ? SN Mar06 22481:53 [kipmi0]
gpu-2-13
root 2415 78.3 0.0 0 0 ? RN Mar10 44180:32 [kipmi0]
gpu-2-14
gpu-2-15
root 2412 55.0 0.0 0 0 ? RN Mar06 34039:05 [kipmi0]
gpu-2-16
root 2388 0.0 0.0 0 0 ? SN Mar06 21:11 [kipmi0]
gpu-2-17
root 2411 1.4 0.0 0 0 ? SN Mar06 877:49 [kipmi0]
gpu-3-8
root 2391 89.3 0.0 0 0 ? RN Mar08 53348:20 [kipmi0]
gpu-3-9
root 2397 0.0 0.0 0 0 ? SN Mar06 18:26 [kipmi0]
I believe this a normal result of ipmi use (which we do for several metrics in the thermal and fan category) but I will confirm my belief and make sure something hasn't gone wrong. I have seen BMC wedge periodically cause higher than expected IPMI loads.
Seems like a few ipmi BMC wedges out there though so looking into it again.
Thank!
Yeah not sure at the moment whats up. I will investigate however.
So I spent a fair chunk of time last night looking at this and I feel in some cases it was normal but in a few nodes we have I feel some of the ASMB6 chips (the IPMI BMC add-on board) that are failing to respond quickly and in a few cases were not responding at all.
This has happened in various forms a few times with this BMC chip. I'm not sure of the cause.
But I believe when that lack of response happens the kipmi thread (while a low priority one according to everything I've ever read) does indeed spend a higher level of time asking said IPMI device whats going on. And continues to do so.
So I've turned off IPMI from the OS (network side remains) for a bit to analyze this further on a particular node that exhibits the problem of non-response and will ask Exxact if perhaps these ASMB6 chips have a component that degrades over time such as a flash memory chip.
All the new nodes came with the ASMB8 version of the chip. I don't know if thats an option.
Thanks for pointing it out. This will be left open until I get further data.
Looks like
kipmi0
is chewing up CPU resources ongpu-2-13
: