cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

[NOT URGENT] Unusually high load on some nodes #335

Closed jchodera closed 9 years ago

jchodera commented 9 years ago

I've noticed there is some unusually high load on some nodes:

[chodera@mskcc-ln1 ~/scripts]$ ./check-nodes-for-load.tcsh 
gpu-1-4
 00:04:09 up 458 days, 13:36,  0 users,  load average: 27.23, 33.06, 42.73
gpu-1-5
 00:04:11 up 83 days,  8:56,  0 users,  load average: 52.47, 51.00, 49.14
gpu-1-6
 00:04:13 up 348 days, 15:49,  0 users,  load average: 70.79, 79.22, 70.22
gpu-1-7
 00:04:15 up 458 days, 13:36,  0 users,  load average: 77.32, 77.41, 65.34
gpu-1-8
 00:04:16 up 453 days,  8:52,  0 users,  load average: 29.68, 28.72, 27.27
gpu-1-9
 00:04:19 up 160 days,  3:21,  0 users,  load average: 58.36, 51.28, 50.73
gpu-1-10
 00:04:20 up 39 days,  9:37,  0 users,  load average: 44.73, 49.50, 46.10
gpu-1-11
 00:04:21 up 453 days,  8:30,  0 users,  load average: 66.08, 61.59, 54.12
gpu-1-12
 00:04:23 up 327 days,  7:18,  0 users,  load average: 31.91, 31.86, 31.07
gpu-1-13
 00:04:24 up 453 days,  8:30,  0 users,  load average: 34.74, 40.48, 40.92
gpu-1-14
 00:04:25 up 458 days,  9:15,  0 users,  load average: 77.50, 66.17, 53.44
gpu-1-15
 00:04:26 up 458 days,  9:15,  0 users,  load average: 45.67, 43.94, 43.25
gpu-1-16
 00:04:28 up 455 days,  3:57,  0 users,  load average: 70.24, 68.13, 65.62
gpu-1-17
 00:04:29 up 458 days,  9:15,  0 users,  load average: 41.51, 40.53, 39.44
gpu-2-4
 00:04:29 up 44 days, 12:59,  0 users,  load average: 33.05, 32.51, 29.90
gpu-2-5
 00:04:30 up 61 days, 11:35,  0 users,  load average: 42.82, 38.13, 36.07
gpu-2-6
 00:04:31 up 61 days, 10:58,  0 users,  load average: 74.13, 56.19, 45.97
gpu-2-7
 00:04:31 up 458 days,  8:25,  1 user,  load average: 47.40, 42.31, 41.73
gpu-2-8
 00:04:32 up 453 days,  8:41,  0 users,  load average: 41.56, 36.51, 33.77
gpu-2-9
 00:04:33 up 453 days,  8:55,  0 users,  load average: 34.43, 34.26, 35.06
gpu-2-10
 00:04:34 up 458 days,  8:16,  0 users,  load average: 90.98, 85.02, 71.35
gpu-2-11
 00:04:35 up 56 days, 15:13,  0 users,  load average: 51.14, 43.98, 39.35
gpu-2-12
 00:04:37 up 453 days,  9:23,  0 users,  load average: 36.71, 38.43, 36.04
gpu-2-13
 00:04:39 up 458 days,  8:13,  0 users,  load average: 47.71, 48.27, 46.37
gpu-2-14
 00:04:40 up 232 days, 12:06,  0 users,  load average: 87.41, 76.09, 70.35
gpu-2-15
 00:04:41 up 458 days,  8:22,  0 users,  load average: 75.62, 95.07, 86.55
gpu-2-16
 00:04:42 up 455 days,  3:45,  0 users,  load average: 50.56, 43.82, 40.07
gpu-2-17
 00:04:43 up 327 days,  8:16,  0 users,  load average: 31.42, 30.26, 30.19
gpu-3-8
 00:04:44 up 74 days, 14:28,  0 users,  load average: 37.88, 38.39, 37.03
gpu-3-9
 00:04:45 up 377 days,  6:15,  0 users,  load average: 53.13, 51.58, 52.15

Normally, this should be < 32 hyperthreads (or perhaps up to 36 if we allow 4 GPU jobs to overcommit per node).

tatarsky commented 9 years ago

Yes, I also noticed this but have not had time to investigate. I am on a project this week. But I will take a look as it seems something is forcing the runqueue longer. On systems with the highest runqueues, the job mix contains a java by @akahles and some ipython by @corcra.

akahles commented 9 years ago

I regularly check my jobs with htop and they should not take more CPU than requested. I will double check though. However, they are fairly high in I/O (almost exclusively to the local /scratch) - could that contribute to the high load value? I remember that this was suggested as an explanation by a previous admin when we observed high loads.

tatarsky commented 9 years ago

load (runqueue) doesn't imply necessarily "more CPU than requested". I/O can indeed result in such matters.

akahles commented 9 years ago

Ok, thanks. Let me know in case you need more information about my jobs.

tatarsky commented 9 years ago

Some brief glances between meetings show a fair number of threads on some nodes also involved in GPFS I/O. So basically appears a blend of reasons so far. I am monitoring off and on.

tatarsky commented 9 years ago

I believe this is related to the java runs. Which seem to have quite the collection of threads each if I'm reading ps -axH right. But I believe their runqueue is confined via the cpuset to the requested cpu count. I've not had much time to dig past that.

ratsch commented 9 years ago

AFAIK, this is GATK. @akahles: did you copy the data to local disk before processing?

On Oct 26, 2015, at 4:03 PM, tatarsky notifications@github.com wrote:

I believe this is related to the java runs. Which seem to have quite the collection of threads each if I'm reading ps -axH right. But I believe their runqueue is confined via the cpuset to the requested cpu count. I've not had much time to dig past that.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/335#issuecomment-151266921.

akahles commented 9 years ago

Hm, ok. The java jobs would be me. I have to check again. I am running 3rd party software and had assumed that running with an argument a la num-threads = 2 would mean that the process will take only two threads.

tatarsky commented 9 years ago

To be clear, I'm not stating this is a problem...

akahles commented 9 years ago

Re @ratsch : Yes, GATK is gettings its data from /scratch

tatarsky commented 9 years ago

Just that its probably what is making the run queue longer. But I could also be wrong. Very distracted today.

tatarsky commented 9 years ago

Just as an example perhaps a second pair of eyes. On gpu-2-6 you have 651 threads of java by my count. In 13 instances. And some are waiting on I/O periodically and I suspect this ups the runqueue.

Again, this just an observation at this point.

jchodera commented 9 years ago

This seems relevant.

jchodera commented 9 years ago

There is a great primer on parallelism in GATK here, though @akahles may have already read this. Sounds like it is actually rather complex to tune the appropriate level of parallelism.

akahles commented 9 years ago

Thanks @jchodera for the pointers. I was aware of the parallelism doc but not of the issue above. However, the GATK support comments the issue with: "The behavior you're observing is related to multithreading in the JVM itself, which is outside of GATK control and may require tuning the JVM parameters for java to behave as desired on your particular system." I will keep digging a bit and see what else I can find.

ratsch commented 9 years ago

Please note that high "load" does not necessarily mean that one is using more cpu time than allowed. Cpu sets make sure that we don't. In Andre's case, there is just a large number of threads waiting for local I/o. That shouldn't affect anybody (local I/o and within defined cpu set). Hence, I'd say it is a peculiarity worth noting, but not something that needs extensive investigation.

Sent from my phone

On Oct 26, 2015, at 4:42 PM, Andre Kahles notifications@github.com wrote:

Thanks @jchodera for the pointers. I was aware of the parallelism doc but not of the issue above. However, the GATK support comments the issue with: "The behavior you're observing is related to multithreading in the JVM itself, which is outside of GATK control and may require tuning the JVM parameters for java to behave as desired on your particular system." I will keep digging a bit and see what else I can find.

— Reply to this email directly or view it on GitHub.

jchodera commented 9 years ago

Thanks! Just though it was unusual and I had noted things on a node with surprisingly high load were running more slowly than usual.

ratsch commented 9 years ago

Thanks! Just though it was unusual and I had noted things on a node with surprisingly high load were running more slowly than usual.

It may be that interactive jobs are less responsive because of the heavy local i/o. Of your job was also performing local i/o.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/335#issuecomment-151491110.