cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

kipmi0 chewing up CPU resources on gpu-2-13? #402

Open jchodera opened 8 years ago

jchodera commented 8 years ago

Looks like kipmi0 is chewing up CPU resources on gpu-2-13:

Tasks: 703 total,   3 running, 700 sleeping,   0 stopped,   0 zombie
Cpu(s): 75.0%us,  2.3%sy,  0.0%ni, 22.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264487924k total, 66141344k used, 198346580k free,   574140k buffers
Swap: 276824052k total,    13132k used, 276810920k free,  8364028k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                                                                                                                   
19267 chodera   20   0 49.1g  46g  27m R 2400.2 18.6   2040:23 python                                                                                                                                                                                                                                                                                                   
 2415 root      39  19     0    0    0 R 74.1  0.0  44178:27 kipmi0     
jchodera commented 8 years ago

Perhaps this is normal? A number of nodes show high kipmi0 CPU usage:

[chodera@mskcc-ln1 ~/scripts]$ ./check-nodes-for-load.tcsh
gpu-1-4
root      2393 54.3  0.0      0     0 ?        RN   Mar06 33621:07 [kipmi0]
gpu-1-5
root      2412 22.5  0.0      0     0 ?        SN   Mar06 13979:21 [kipmi0]
gpu-1-6
root      2410 48.8  0.0      0     0 ?        RN   Mar06 30323:36 [kipmi0]
gpu-1-7
root      2406 88.2  0.0      0     0 ?        RN   Mar06 54733:53 [kipmi0]
gpu-1-8
root      2391 10.1  0.0      0     0 ?        RN   Mar06 6277:19 [kipmi0]
gpu-1-9
gpu-1-10
root      2400  0.0  0.0      0     0 ?        SN   Mar06  18:41 [kipmi0]
gpu-1-11
root      2406 24.6  0.0      0     0 ?        SN   Mar06 15288:21 [kipmi0]
gpu-1-12
root      2379  0.0  0.0      0     0 ?        SN   Mar06  16:41 [kipmi0]
gpu-1-13
root      2409 22.9  0.0      0     0 ?        SN   Mar06 14194:35 [kipmi0]
gpu-1-14
root      2390 69.6  0.0      0     0 ?        RN   Mar06 43205:29 [kipmi0]
gpu-1-15
root      2395 17.9  0.0      0     0 ?        RN   Mar06 11128:12 [kipmi0]
gpu-1-16
root      2414  0.4  0.0      0     0 ?        SN   Mar06 270:48 [kipmi0]
gpu-1-17
root      2396  0.0  0.0      0     0 ?        SN   Mar06  18:23 [kipmi0]
gpu-2-4
root      2383  0.0  0.0      0     0 ?        SN   Mar06  15:59 [kipmi0]
gpu-2-5
root      2403 24.1  0.0      0     0 ?        RN   Mar06 14953:05 [kipmi0]
gpu-2-6
root      2407  6.6  0.0      0     0 ?        SN   Mar06 4096:22 [kipmi0]
gpu-2-7
root      2416 90.7  0.0      0     0 ?        RN   Mar06 56203:51 [kipmi0]
gpu-2-8
root      2446 72.2  0.0      0     0 ?        RN   Mar30 19923:10 [kipmi0]
gpu-2-9
gpu-2-10
root      2405  0.0  0.0      0     0 ?        SN   Mar06  21:07 [kipmi0]
gpu-2-11
gpu-2-12
root      2421 36.3  0.0      0     0 ?        SN   Mar06 22481:53 [kipmi0]
gpu-2-13
root      2415 78.3  0.0      0     0 ?        RN   Mar10 44180:32 [kipmi0]
gpu-2-14
gpu-2-15
root      2412 55.0  0.0      0     0 ?        RN   Mar06 34039:05 [kipmi0]
gpu-2-16
root      2388  0.0  0.0      0     0 ?        SN   Mar06  21:11 [kipmi0]
gpu-2-17
root      2411  1.4  0.0      0     0 ?        SN   Mar06 877:49 [kipmi0]
gpu-3-8
root      2391 89.3  0.0      0     0 ?        RN   Mar08 53348:20 [kipmi0]
gpu-3-9
root      2397  0.0  0.0      0     0 ?        SN   Mar06  18:26 [kipmi0]
tatarsky commented 8 years ago

I believe this a normal result of ipmi use (which we do for several metrics in the thermal and fan category) but I will confirm my belief and make sure something hasn't gone wrong. I have seen BMC wedge periodically cause higher than expected IPMI loads.

tatarsky commented 8 years ago

Seems like a few ipmi BMC wedges out there though so looking into it again.

jchodera commented 8 years ago

Thank!

tatarsky commented 8 years ago

Yeah not sure at the moment whats up. I will investigate however.

tatarsky commented 8 years ago

So I spent a fair chunk of time last night looking at this and I feel in some cases it was normal but in a few nodes we have I feel some of the ASMB6 chips (the IPMI BMC add-on board) that are failing to respond quickly and in a few cases were not responding at all.

This has happened in various forms a few times with this BMC chip. I'm not sure of the cause.

But I believe when that lack of response happens the kipmi thread (while a low priority one according to everything I've ever read) does indeed spend a higher level of time asking said IPMI device whats going on. And continues to do so.

So I've turned off IPMI from the OS (network side remains) for a bit to analyze this further on a particular node that exhibits the problem of non-response and will ask Exxact if perhaps these ASMB6 chips have a component that degrades over time such as a flash memory chip.

All the new nodes came with the ASMB8 version of the chip. I don't know if thats an option.

Thanks for pointing it out. This will be left open until I get further data.