protect load balancer from variable cpu clock

charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.

Apache License 2.0

207 stars 50 forks source link

protect load balancer from variable cpu clock #641

Open jcphill opened 9 years ago

jcphill commented 9 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/641

When running netlrts-smp on Linux with two processes on a single node for GPU-accelerated NAMD, the OS sees some cores (generally those associated with one process or the other) as less busy and slows the cpu clock, which exaggerates the load imbalance and results in even worse overload of the other process following load balancing. Setting the cpu frequency scaling governors to "performance" makes the issue go away, but we need to understand why the OS isn't reading the CPUs as fully loaded or find a way for the load balancer to cope with time-varying cpu speeds.

jcphill commented 5 years ago

Original date: 2014-12-31 17:05:49

The OS was reading the CPU as not loaded due to frequent calls to CmiMachineProgressImpl() in NAMD entry methods. These calls were taking a lot of time waiting on something in the machine layer. When the CmiMachineProgressImpl() calls are removed the OS keeps the cores at full or near-full speed. It is possible that it is a bug for user code to call CmiMachineProgressImpl() in non-smp builds, or at least is shouldn't serve any purpose when there is a communication thread. These calls were originally added to ensure that high-priority incoming messages were received promptly.

jcphill commented 5 years ago

Original date: 2015-01-16 02:50:37

That should be "It is possible that it is a bug for user code to call CmiMachineProgressImpl() in smp builds". In any case, I assume this is related to the issue of the comm thread holding the comm lock while sending and receiving messages rather than only when manipulating queues.

harshithamenon commented 5 years ago

Original date: 2015-08-17 19:14:40

Is this bug related to https://charm.cs.illinois.edu/redmine/issues/642? If so, does that fix eliminate this problem?

jcphill commented 5 years ago

Original date: 2015-08-19 01:42:56

I assume that fixes the problem. I also eliminated the CmiMachineProgressImpl() calls in NAMD for smp builds.

stwhite91 commented 5 years ago

Original date: 2017-02-01 17:42:45

It looks like this was fixed a while ago in the following two commits: ~~https://charm.cs.illinois.edu/gerrit/#/c/577/~~ https://github.com/UIUC-PPL/charm/commit/64fb65c3ea86e714f1a3549a13dad7662ecee274 ~~https://charm.cs.illinois.edu/gerrit/#/c/638/~~ https://github.com/UIUC-PPL/charm/commit/81f00d788030abd2a5df0e6d95057df41f658ed3

Anything left to do on this issue?

stwhite91 commented 5 years ago

Original date: 2018-03-28 01:21:40

Please close this issue if it is done