Open jcphill opened 9 years ago
Original date: 2014-12-31 17:05:49
The OS was reading the CPU as not loaded due to frequent calls to CmiMachineProgressImpl() in NAMD entry methods. These calls were taking a lot of time waiting on something in the machine layer. When the CmiMachineProgressImpl() calls are removed the OS keeps the cores at full or near-full speed. It is possible that it is a bug for user code to call CmiMachineProgressImpl() in non-smp builds, or at least is shouldn't serve any purpose when there is a communication thread. These calls were originally added to ensure that high-priority incoming messages were received promptly.
Original date: 2015-01-16 02:50:37
That should be "It is possible that it is a bug for user code to call CmiMachineProgressImpl() in smp builds". In any case, I assume this is related to the issue of the comm thread holding the comm lock while sending and receiving messages rather than only when manipulating queues.
Original date: 2015-08-17 19:14:40
Is this bug related to https://charm.cs.illinois.edu/redmine/issues/642? If so, does that fix eliminate this problem?
Original date: 2015-08-19 01:42:56
I assume that fixes the problem. I also eliminated the CmiMachineProgressImpl() calls in NAMD for smp builds.
Original date: 2017-02-01 17:42:45
It looks like this was fixed a while ago in the following two commits:
https://charm.cs.illinois.edu/gerrit/#/c/577/ https://github.com/UIUC-PPL/charm/commit/64fb65c3ea86e714f1a3549a13dad7662ecee274
https://charm.cs.illinois.edu/gerrit/#/c/638/ https://github.com/UIUC-PPL/charm/commit/81f00d788030abd2a5df0e6d95057df41f658ed3
Anything left to do on this issue?
Original date: 2018-03-28 01:21:40
Please close this issue if it is done
Original issue: https://charm.cs.illinois.edu/redmine/issues/641
When running netlrts-smp on Linux with two processes on a single node for GPU-accelerated NAMD, the OS sees some cores (generally those associated with one process or the other) as less busy and slows the cpu clock, which exaggerates the load imbalance and results in even worse overload of the other process following load balancing. Setting the cpu frequency scaling governors to "performance" makes the issue go away, but we need to understand why the OS isn't reading the CPUs as fully loaded or find a way for the load balancer to cope with time-varying cpu speeds.