POETSII / Orchestrator

The Orchestrator is the configuration and run-time management system for POETS platforms.
1 stars 1 forks source link

RTCL utilises 1 CPU permanently at 100% #238

Closed m8pple closed 3 years ago

m8pple commented 3 years ago

This is related to #236 , but the root cause is a bit different.

Once CommonBase is able to avoid calling OnIdle all the time, we get left with a thread still spinning at 100%:

image

It is just locked in rtcl_func, and looks like it is spinning on a single tiny loop:

image

My reading of x86 is not what it used to be, plus perf profiling is not that accurate, but that test then jal sequence looks like an infinite loop to me. Possibly related to #237

Even if it is not an infinite loop, it is still wasting a huge amount of time and battery. I don't see why it can't sleep for 1ms or something (ideally longer, as that still probably keeps CPUs in a high power state).

m8pple commented 3 years ago

Changing it from:

void * rtcl_func(void * args)
// Thread function spin on the MPI RTC, bleating every so often.
// DO NOT post messages from this thread, because you cannot rely on anything
// being alive at the end to get them, and MPI will block.
{
RTCL::comms_t * pC = static_cast<RTCL::comms_t *>(args);
double t_ = MPI_Wtime();
double t;
for(;;) {
  if (pC->l_kill) break;
  if (pC->l_stop) continue;
  }

to:

void * rtcl_func(void * args)
// Thread function spin on the MPI RTC, bleating every so often.
// DO NOT post messages from this thread, because you cannot rely on anything
// being alive at the end to get them, and MPI will block.
{
RTCL::comms_t * pC = static_cast<RTCL::comms_t *>(args);
double t_ = MPI_Wtime();
double t;
for(;;) {
  if (pC->l_kill) break;
  if (pC->l_stop) {
    OSFixes::sleep(10);
    continue;
  }

reduced CPU massively, and had no discernible effect. Though I'm not really sure where RTCL is used from, or what exactly it delays. I'm assuming that the accuracy with which it prints "TICK" to stdout is not critical, given there will be lots of jitter on the file flushing anyway.

heliosfa commented 3 years ago

RTCL is designed to generate high-precision and high-accuracy events. The use-case of RTCL is to provide these events to another MPI process, e.g. the Mothership(s), on varying schedules. A uniform unloaded spinner was used to avoid needing to faff with variable length sleeps.

Nothing currently uses RTCL in actual deployment (implementation is pending, but will be part of the Supervisor API).

The way forward is that we will not run RTCL for the time being, and will bring it back when functionality using has been implemented.