Closed jchodera closed 9 years ago
Yes, I also noticed this but have not had time to investigate. I am on a project this week. But I will take a look as it seems something is forcing the runqueue longer. On systems with the highest runqueues, the job mix contains a java by @akahles and some ipython by @corcra.
I regularly check my jobs with htop
and they should not take more CPU than requested. I will double check though. However, they are fairly high in I/O (almost exclusively to the local /scratch
) - could that contribute to the high load value? I remember that this was suggested as an explanation by a previous admin when we observed high loads.
load (runqueue) doesn't imply necessarily "more CPU than requested". I/O can indeed result in such matters.
Ok, thanks. Let me know in case you need more information about my jobs.
Some brief glances between meetings show a fair number of threads on some nodes also involved in GPFS I/O. So basically appears a blend of reasons so far. I am monitoring off and on.
I believe this is related to the java runs. Which seem to have quite the collection of threads each if I'm reading ps -axH right. But I believe their runqueue is confined via the cpuset to the requested cpu count. I've not had much time to dig past that.
AFAIK, this is GATK. @akahles: did you copy the data to local disk before processing?
On Oct 26, 2015, at 4:03 PM, tatarsky notifications@github.com wrote:
I believe this is related to the java runs. Which seem to have quite the collection of threads each if I'm reading ps -axH right. But I believe their runqueue is confined via the cpuset to the requested cpu count. I've not had much time to dig past that.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/335#issuecomment-151266921.
Hm, ok. The java jobs would be me. I have to check again. I am running 3rd party software and had assumed that running with an argument a la num-threads = 2
would mean that the process will take only two threads.
To be clear, I'm not stating this is a problem...
Re @ratsch : Yes, GATK is gettings its data from /scratch
Just that its probably what is making the run queue longer. But I could also be wrong. Very distracted today.
Just as an example perhaps a second pair of eyes. On gpu-2-6 you have 651 threads of java by my count. In 13 instances. And some are waiting on I/O periodically and I suspect this ups the runqueue.
Again, this just an observation at this point.
There is a great primer on parallelism in GATK here, though @akahles may have already read this. Sounds like it is actually rather complex to tune the appropriate level of parallelism.
Thanks @jchodera for the pointers. I was aware of the parallelism doc but not of the issue above. However, the GATK support comments the issue with: "The behavior you're observing is related to multithreading in the JVM itself, which is outside of GATK control and may require tuning the JVM parameters for java to behave as desired on your particular system." I will keep digging a bit and see what else I can find.
Please note that high "load" does not necessarily mean that one is using more cpu time than allowed. Cpu sets make sure that we don't. In Andre's case, there is just a large number of threads waiting for local I/o. That shouldn't affect anybody (local I/o and within defined cpu set). Hence, I'd say it is a peculiarity worth noting, but not something that needs extensive investigation.
Sent from my phone
On Oct 26, 2015, at 4:42 PM, Andre Kahles notifications@github.com wrote:
Thanks @jchodera for the pointers. I was aware of the parallelism doc but not of the issue above. However, the GATK support comments the issue with: "The behavior you're observing is related to multithreading in the JVM itself, which is outside of GATK control and may require tuning the JVM parameters for java to behave as desired on your particular system." I will keep digging a bit and see what else I can find.
— Reply to this email directly or view it on GitHub.
Thanks! Just though it was unusual and I had noted things on a node with surprisingly high load were running more slowly than usual.
Thanks! Just though it was unusual and I had noted things on a node with surprisingly high load were running more slowly than usual.
It may be that interactive jobs are less responsive because of the heavy local i/o. Of your job was also performing local i/o.
— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/335#issuecomment-151491110.
I've noticed there is some unusually high load on some nodes:
Normally, this should be < 32 hyperthreads (or perhaps up to 36 if we allow 4 GPU jobs to overcommit per node).