[Metricbeat] Normalize cgroup CPU data

sorantis commented 3 years ago

The customer would like to gather normalized CPU accounting per cgroup (e.g. for LXC containers). Metricbeat's CPU accounting can be collected per cgroup, but is then reported in snapshots of nanoseconds of CPU time since the cgroup was started, for example:

"cgroup": {
    "cpuacct": {
      "percpu": {
        "1": 3831401199507,
        "2": 4619294890800,
        "3": 4604619599210,
        "4": 4379214837270,
        "5": 4126400747412,
        "6": 3939661307456
      },
      "stats": {
        "user": {
          "ns": 15226370000000
        },
        "system": {
          "ns": 8906990000000
        }
      },
      "id": "lxv1393",
      "total": {
        "ns": 25500592581655
      },
      "path": "/lxc/lxv1393"
    }
  }

Since we do have the total nanoseconds, we can provide the percpu values as normalized percentages, similar to how it's done for system module's cpu metricset:

cpu.metrics: [percentages, normalized_percentages, ticks]

We would probably need a similar config option for cgroups, like:

cgroups.cpu.metrics: [percentages, normalized_percentages, ticks]

elasticmachine commented 3 years ago

Pinging @elastic/integrations-services (Team:Services)

fearful-symmetry commented 3 years ago

Looking at the docs for the CPU accounting controller here: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cpuacct.html

user and system are in USER_HZ unit.

These are the tick units we reference elsewhere.

So, looking at the original issue, I'm a tad confused by the wording. If we want to map any given tick usage in cgroups to overall usage, that should be possible, as we're dealing with the same units, and we measure everything based off time deltas. If we're just trying to normalize the per-cpu cgroup metrics, that's possible as well.

sorantis commented 3 years ago

@fearful-symmetry Here's the original request:

customer uses lxc as container environment and we want to monitoring them with metricbeat but we need to figure out how we get pct metrics especially for stats user and stats system

masci commented 3 years ago

I've updated the issue description with more info coming from a support case. This is doable at the module level, we should be good adding this to the next release cycle. Thanks @fearful-symmetry, going to un-assign you and move the issue to our backlog.

fearful-symmetry commented 3 years ago

@masci / @sorantis / @kesslerm a brief update, since we want this done for 7.13

I spent a good bit of the day digging through how we "normalize" cpu metrics in other parts of metricbeat, since our CPU monitoring code is spread across what feels like 4 different libraries.

So, I'm assuming that we want a "nice" percentage number that's comparable to how we report cpu usage percentages elsewhere.

There's an interesting caveat here I didn't notice earlier--The cgroup itself reports totals in nanoseconds, but user and kernel time in USER_HZ. We do some math to convert everything to nanoseconds, which is why we get numbers like 15226370000000 compared to 3831401199507. This means that we're not going to get entirely accurate numbers if we start trying to calculate percentages from user and system. Elsewhere, everything is in USER_HZ, so the math is cleaner.

There's some other caveats here, as the cgroup docs mentions:

cpuacct controller uses percpu_counter interface to collect user and system times. This has two side effects:

It is theoretically possible to see wrong values for user and system times. This is because percpu_counter_read() on 32bit systems isn’t safe against concurrent writes.

It is possible to see slightly outdated values for user and system times due to the batch processing nature of percpu_counter.

So, we might run into issues if we try and make normalized percentages of user and system.

sorantis commented 3 years ago

Thanks for the analysis @fearful-symmetry. I'm trying to understand whether our approach is to convert user and kernel from the converted nanoseconds to pct, or from the USER_HZ to pct. Will both cases lead inaccurate numbers for pct?

Also, for the two mentioned caveats:

Does it mean that the wrong values can be observed on 32bit systems only?
At what sampling intervals can these values become outdated? Are we talking about real time monitoring or even at say 10-15 second interval?

fearful-symmetry commented 3 years ago

I'm trying to understand whether our approach is to convert user and kernel from the converted nanoseconds to pct, or from the USER_HZ to pct. Will both cases lead inaccurate numbers for pct?

The issue is more that the cpuacct API at /sys/fs/cgroup reports metrics in two formats. For other CPU usage data, like /proc/stat and /proc/[PID]/stat, which is where we get our CPU usage metrics for system/cpu and /system/process, the entire sets of metrics are reported in USER_HZ. Converting these to nanoseconds or percentages isn't an issue, the kernel provides APIs to make the math more reliable (_SC_CLK_TCK).

The issue is that we already get some metrics in nanoseconds and some not, so I think we're going to get some interesting math. For example, to emulate how system/cpu does percents using the CPU total, if we calculated user and system based on your example above, we'd get 59% and 34%, which doesn't quite add up. We can emulate system/process instead, and use the nanoseconds between collection intervals as a standin for totals, but I wonder if this will result in some slight discrepancies between the numbers for system, user and the numbers for everything else. The latter method is probably better, and we just might want to put some disclaimers in the docs.

sorantis commented 3 years ago

For context, I assume the customer used this snapshot of Metricbeat. Correct me if I'm wrong @liladler

liladler commented 3 years ago

this is the link the customer has tried.

fearful-symmetry commented 3 years ago

@liladler how many process events did the system collect? The very first event that metricbeat collects will have the percentages set to zero, as it needs processes across time to create a percentage. Also, can I get the entire event?

liladler commented 3 years ago

@fearful-symmetry By now the system collected millions of documents, this is a full event -

          "event" : {
            "dataset" : "system.process",
            "duration" : 460504378,
            "module" : "system"
          },
          "env" : "tier1",
          "@timestamp" : "2021-04-28T12:32:20.401Z",
          "logstash" : {
            "tier1" : "flt025547",
            "tier2" : "flt031502"
          },
          "system" : {
            "process" : {
              "state" : "sleeping",
              "cmdline" : "/usr/sbin/nscd",
              "cgroup" : {
                "blkio" : {
                  "id" : "lxv1394",
                  "total" : {
                    "ios" : 196713,
                    "bytes" : 1138053120
                  },
                  "path" : "/lxc/lxv1394"
                },
                "id" : "lxv1394",
                "cpuacct" : {
                  "id" : "lxv1394",
                  "percpu" : {
                    "5" : 4468439526456,
                    "1" : 4246731060274,
                    "4" : 4944152156744,
                    "6" : 4296676679369,
                    "3" : 5297998803640,
                    "2" : 5326331389417
                  },
                  "stats" : {
                    "system" : {
                      "ns" : 14492870000000,
                      "pct" : 0.001,
                      "norm" : {
                        "pct" : 2.0E-4
                      }
                    },
                    "user" : {
                      "ns" : 12603240000000,
                      "pct" : 0.002,
                      "norm" : {
                        "pct" : 3.0E-4
                      }
                    }
                  },
                  "total" : {
                    "ns" : 28580329615900,
                    "pct" : 0.004,
                    "norm" : {
                      "pct" : 7.0E-4
                    }
                  },
                  "path" : "/lxc/lxv1394"
                },
                "cpu" : {
                  "id" : "lxv1394",
                  "rt" : {
                    "period" : {
                      "us" : 1000000
                    },
                    "runtime" : {
                      "us" : 0
                    }
                  },
                  "stats" : {
                    "periods" : 0,
                    "throttled" : {
                      "ns" : 0,
                      "periods" : 0
                    }
                  },
                  "cfs" : {
                    "shares" : 1024,
                    "quota" : {
                      "us" : 0
                    },
                    "period" : {
                      "us" : 100000
                    }
                  },
                  "path" : "/lxc/lxv1394"
                },
                "memory" : {
                  "mem" : {
                    "failures" : 0,
                    "limit" : {
                      "bytes" : 9223372036854771712
                    },
                    "usage" : {
                      "max" : {
                        "bytes" : 1065435136
                      },
                      "bytes" : 992104448
                    }
                  },
                  "stats" : {
                    "page_faults" : 1200998712,
                    "unevictable" : {
                      "bytes" : 0
                    },
                    "pages_in" : 225533559,
                    "inactive_anon" : {
                      "bytes" : 199987200
                    },
                    "hierarchical_memory_limit" : {
                      "bytes" : 9223372036854771712
                    },
                    "active_anon" : {
                      "bytes" : 478814208
                    },
                    "inactive_file" : {
                      "bytes" : 113795072
                    },
                    "rss" : {
                      "bytes" : 57786368
                    },
                    "swap" : {
                      "bytes" : 0
                    },
                    "hierarchical_memsw_limit" : {
                      "bytes" : 9223372036854771712
                    },
                    "rss_huge" : {
                      "bytes" : 8388608
                    },
                    "cache" : {
                      "bytes" : 934318080
                    },
                    "pages_out" : 256972324,
                    "major_page_faults" : 1496,
                    "mapped_file" : {
                      "bytes" : 11005952
                    },
                    "active_file" : {
                      "bytes" : 199507968
                    }
                  },
                  "kmem_tcp" : {
                    "failures" : 0,
                    "limit" : {
                      "bytes" : 9223372036854771712
                    },
                    "usage" : {
                      "max" : {
                        "bytes" : 0
                      },
                      "bytes" : 0
                    }
                  },
                  "kmem" : {
                    "failures" : 0,
                    "limit" : {
                      "bytes" : 9223372036854771712
                    },
                    "usage" : {
                      "max" : {
                        "bytes" : 0
                      },
                      "bytes" : 0
                    }
                  },
                  "id" : "lxv1394",
                  "memsw" : {
                    "failures" : 0,
                    "limit" : {
                      "bytes" : 9223372036854771712
                    },
                    "usage" : {
                      "max" : {
                        "bytes" : 1065435136
                      },
                      "bytes" : 992104448
                    }
                  },
                  "path" : "/lxc/lxv1394"
                },
                "path" : "/lxc/lxv1394"
              },
              "fd" : {
                "limit" : {
                  "soft" : 1024,
                  "hard" : 4096
                },
                "open" : 12
              },
              "cpu" : {
                "start_time" : "2021-03-23T12:50:41.000Z",
                "total" : {
                  "value" : 128970,
                  "pct" : 0,
                  "norm" : {
                    "pct" : 0
                  }
                }
              },
              "memory" : {
                "size" : 605614080,
                "share" : 1269760,
                "rss" : {
                  "pct" : 1.0E-4,
                  "bytes" : 1998848
                }
              }
            }
          },
          "host" : {
            "name" : "lx00590"
          },
          "@version" : "1",
          "fields" : {
            "elastic_index" : "demo"
          },
          "tags" : [
            "beats_input_raw_event"
          ],
          "metricset" : {
            "name" : "process",
            "period" : 10000
          },
          "agent" : {
            "hostname" : "lx00590",
            "name" : "lx00590",
            "id" : "fc1b3dfe-79b8-4608-bdb6-52aadee95b32",
            "version" : "7.13.0",
            "ephemeral_id" : "7bd1a27f-4962-4ba8-a334-a32bd34ab60e",
            "type" : "metricbeat"
          },
          "process" : {
            "state" : "sleeping",
            "command_line" : "/usr/sbin/nscd",
            "pgid" : 40965,
            "args" : [
              "/usr/sbin/nscd"
            ],
            "cpu" : {
              "start_time" : "2021-03-23T12:50:41.000Z",
              "pct" : 0
            },
            "name" : "nscd",
            "ppid" : 40886,
            "working_directory" : "/",
            "pid" : 40965,
            "executable" : "/usr/sbin/nscd",
            "memory" : {
              "pct" : 1.0E-4
            }
          },
          "user" : {
            "name" : "nscd"
          },
          "ecs" : {
            "version" : "1.9.0"
          },
          "service" : {
            "type" : "system"
          },
          "type" : "metricbeat",
          "protocol" : "tcp"
        }

fearful-symmetry commented 3 years ago

@liladler based on what you sent me, I'm having a hard time telling if something is wrong:

                  "total" : {
                    "ns" : 28580329615900,
                    "pct" : 0.004,
                    "norm" : {
                      "pct" : 7.0E-4
                    }
                  },

A cpuacct usage of 0.4% for a random background process seems pretty normal. The normalized percentage is a product of the CPU count, as it's "normalized" by the average usage across all CPUs, so we get 0.004/6= ~0.0007 or 0.07%. Can we try filtering/sorting the processes by CPU usage and seeing if the numbers seem a bit more normal? Alternatively, are any events reporting a usage that's actually 0?

fearful-symmetry commented 3 years ago

Considering that all the relevant PRs have been merged, do we want to close this issue?

sorantis commented 3 years ago

@liladler has the customer tried the recommendation? Everything seems to be working in order. If there are no further questions from the customer then we'll close the issue.

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

jlind23 commented 2 years ago

@fearful-symmetry is this issue still relevant following all the refactors you did?

fearful-symmetry commented 2 years ago

@jlind23 looks like all the changes have been merged, we should be able to close this.

elastic / beats

[Metricbeat] Normalize cgroup CPU data #23391