[BUG] First event group reports wrong values if system was idle at start of measurement (C-API)

ntippman commented 10 months ago

Describe the bug I use the C-API to continuously monitor multiple event groups and thus switch between them. I noticed that some of the values were completely off and it seems to only affect the first event group, regardless of the group itself. But only, if there was no initial load on the system at the start of measurement...

Tests were conducted on 2x Intel 6252 and 2x Intel 8360Y.

For my reproducer I am measuring two groups, CLOCK and FLOPS_SP. I will focus on the metric Clock [MHz] which is present in both groups. However, the bug affects all metrics of the group. This bug will only occur when at least 2 groups are measured.

When there is no load on the server LIKWID mainly reports nan values which is understandable and not an issue: (the values are for each thread, output shortened)

# CLOCK group -  Clock [MHz]
 nan, nan, 2357.49, nan, nan, nan, nan, nan, nan, nan, nan, 994.994, nan, nan, nan, nan,...
# FLOPS_SP group -  Clock [MHz]
 nan, nan, 2482.05, nan, nan, nan, nan, nan, nan, nan, nan, 974.47, nan, nan, nan, nan,...

When applying a load (FIRESTARTER ) I would expect that both groups now report proper values. Unfortunately, the first group will never report reasonable values while the second group now does:

# CLOCK group -  Clock [MHz]
995.859, nan, nan, nan, 999.119, 999.353, nan, nan, nan, 999.369, nan, nan, nan, nan, nan,...
# FLOPS_SP group -  Clock [MHz]
1555.62, 1555.62, 1555.62, 1555.62, 1555.62, 1555.62, 1555.63, 1555.63, 1555.62, 1555.63,...

This is regardless of how long the load was applied. Measurement interval is 5 seconds, so roughly 2.5 seconds for each group.

If I start a load before starting the measurements, everything is fine and all metrics of all groups report correct values. This only happens when there was no load at the start of measurement and only with at least one switch between groups!

To Reproduce C-API of current master is used on AlmaLinux 9.2.

General code (simplified):

HPMmode(ACCESSMODE_DIRECT);
HPMinit();
perfmon_init(cpus.size(), &cpus[0]);

for (size_t i = 0; i < groupList.size(); i++){
    int gid = perfmon_addEventSet(groupList[i].c_str());
    gids.push_back(gid);
}

perfmon_setupCounters(gids[0])

while(1){
  for (size_t i = 0; i< gids.size(); i++) {
    if (i== 0)
      perfmon_startCounters());
    else
      perfmon_switchActiveGroup(gids[i]);
    usleep(setTime);
  }

  perfmon_stopCounters();

  // print values

  for (size_t i = 0; i < gids.size(); i++) {
    int metricCount = perfmon_getNumberOfMetrics(gids[i]);
    for (int k = 0; k < metricCount; k++) {
        std::string metricName = std::string(perfmon_getMetricName(gids[i], k));
        for (size_t l = 0; l < cpuCount; l++) {
            double value = perfmon_getLastMetric(gids[i], k, l);
            std::cout << value << ", ";
            // if (std::isnan(value)) value = 0.0;
        }
        std::cout << std::endl;
    }
  }
}

Workaround

There seems to be something wrong with perfmon_startCounters or perfmon_switchActiveGroup on idle machines. One workaround I found was to call perfmon_switchActiveGroup right after perfmon_startCounters but on the same event group.

perfmon_setupCounters(gids[0])

while(1){
  for (size_t i = 0; i< gids.size(); i++) {
      if (i== 0){
        perfmon_startCounters());
        // switch to same group which seems to fix the setup of the initial group
        perfmon_switchActiveGroup(gids[0]);
      }else
        perfmon_switchActiveGroup(gids[i]);
      usleep(setTime);
  }

  perfmon_stopCounters();
}

With this workaround all reported values of all groups are correct, regardless of the node being idle upon measurement start.

I am not sure if my assumptions are correct, but at least the bug is reproducible on my setup. Please find attached the output of my reproducer (with verbosity 1 and 3).

likwid_output_verbosity1.txt likwid_output_verbosity3.txt

TomTheBear commented 10 months ago

Thanks for reporting.

I'm surprised by your workaround because i == 0, so you switch to the same group (gid[i]) after setting up & starting it. The switch group function has an early return if new_group == groupSet->activeGroup, so perfmon_switchActiveGroup(gids[0]); should do nothing.

Have you tried using the logic from examples/monitoring.c?

perfmon_setupCounters(gid);
perfmon_startCounters();
sleep(sleeptime);
perfmon_stopCounters();
// save results, here just printing
for (c = 0; c < numCpus; c++)
{
  for (i = 0; i< perfmon_getNumberOfMetrics(gid); i++)
  {
      printf("%s,cpu=%d %f\n", perfmon_getMetricName(gid, i), cpus[c], perfmon_getLastMetric(gid, i, c));
  }
}

Since you are using the direct access mode, many counter accesses will be performed through the rdpmc instruction. Can you test how it behaves when not using rdpmc (comment line https://github.com/RRZE-HPC/likwid/blob/master/src/access_x86_msr.c#L147)?

ntippman commented 10 months ago

With your hint that perfmon_switchActiveGroup(gids[0]) should not do anything in my case helped me find the issue. I logged the activeGroup and whether a switch was performed or skipped.

It turns out that perfmon_switchActiveGroup(gids[0]) actually does perform the switch after the first interval. But my own code is at fault for that, not LIKWID.

Let's assume that gids = [0,1]. After the first iteration of the while-loop the activeGroup is 1. When perfmon_startCounters is called the activeGroup is still 1 and therefore gid 0 is never again measured after the first iteration. This also explains why perfmon_switchActiveGroup(gids[0]) does actually work. And it also becomes clear why gid 0 reports values if there was a load on the system when starting the measurement - because it keeps reporting the same values from the one initial measurement...

I guess there are three solutions for this.

Either never call perfmon_startCounters and let perfmon_switchActiveGroup handle everything. Which works but may not be the cleanest solution...

while(1){
  for (size_t i = 0; i< gids.size(); i++) {
    perfmon_switchActiveGroup(gids[i]);
    usleep(setTime);
  }
  perfmon_stopCounters();
}

Or always perform a proper perfmon_setupCounters(gids[0])

while(1){
  for (size_t i = 0; i< gids.size(); i++) {
    if (i== 0){
      perfmon_setupCounters(gids[0])
      perfmon_startCounters());
    } else
      perfmon_switchActiveGroup(gids[i]);
    usleep(setTime);
  }
  perfmon_stopCounters();
}

Or as recommended by examples/monitoring.c

while(1){
  for (size_t i = 0; i< gids.size(); i++) {
    perfmon_setupCounters(gids[i])
    perfmon_startCounters());
    usleep(setTime);
    perfmon_stopCounters();
  }
}

Are there any benefits to calling setup/start/stop for each gid instead of using perfmon_switchActiveGroup?

Thank your for your help!

TomTheBear commented 10 months ago

The perfmon_switchActiveGroup code itself is basically a (if running -> stop(current)), setup(new), start(new), current = new, so in the end it does not matter whether you use perfmon_switchActiveGroup or any of the other approaches. IMHO the if-else in the second approach looks complicated, the third is the most clean.

RRZE-HPC / likwid

[BUG] First event group reports wrong values if system was idle at start of measurement (C-API) #563