Closed przemek-grzedzielski closed 2 years ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
@fearful-symmetry does anything jump out at you as the cause of this one?
So, we chatted about this over slack, and I suspect that now that the cgroups code has to deal with V1 and V2 processes, it's overly conservative with trying to report the correct data, which means on hybrid systems we get interesting results, like IDs of /
where it shouldn't report cgroups, or it just skipping IDs in cases where it doesn't see 100% consistent file paths.
I'm gonna go back and look at the pre-V2 code and see if I can figure out how it handled some of these edge cases here.
@fearful-symmetry I think we can close this issue, right? I checked that the metrics are properly ingested by metricbeat 7.17.5 and the dashboard looks ok. Thanks a lot for fixing this.
Bug description
I observed Elastic Cloud Enterprise that the
[Metricbeat System] Containers overview ECS
Kibana dashboard is displaying data that seems to be incorrect: the Container ID columnt contains lots of/
values.Looking at the documents in
metricbeat-*
indices, I noticed that thesystem.process.cgroup.id
field contains either the mysterious/
value (this happens to be the case for many system processes) or has an empty value (this happens to be the case for many user processes):I compared documents ingested by beats 7.12 (ECE 3.0) with ones ingested by beats 7.17.1.
Documents displaying the "-" cgroup ID
Documents ingested by metricbeat 7.17.1
I noticed that these documents do not contain the flattened
system.process.cgroup.id
field even though they do contain a nested value in{"system": {"process":{"cgroup":{"id" : "xyz"}}}}
. My guess is that it's the flattened field that populates the dashboards.Example document ingested by metricbeat 7.17.1
```json { "_index": "metricbeat-7.17.1-2022.03.23", "_type": "_doc", "_id": "lCmlt38B7-W6we8xymG0", "_version": 1, "_score": 1, "_source": { "@timestamp": "2022-03-23T16:39:28.628Z", "service": { "type": "system" }, "system": { "process": { "state": "sleeping", "cpu": { "total": { "pct": 0.0147, "norm": { "pct": 0.0037 }, "value": 1977490 }, "start_time": "2022-03-22T09:16:40.000Z" }, "cgroup": { "cpu": { "id": "docker.service", "path": "/system.slice/docker.service", "cfs": { "period": { "us": 100000 }, "quota": { "us": 0 }, "shares": 1024 }, "rt": { "period": { "us": 0 }, "runtime": { "us": 0 } }, "stats": { "periods": 0, "throttled": { "us": 0, "periods": 0 } } }, "cpuacct": { "path": "/system.slice/docker.service", "total": { "pct": 0.0258, "norm": { "pct": 0.0064 }, "ns": 2604894051076 }, "percpu": { "1": 652220788827, "2": 655150057935, "3": 647406003603, "4": 650117200711 }, "stats": { "user": { "ns": 1502960000000, "pct": 0.0107, "norm": { "pct": 0.0027 } }, "system": { "norm": { "pct": 0.0024 }, "ns": 713680000000, "pct": 0.0097 } }, "id": "docker.service" }, "memory": { "id": "docker.service", "path": "/system.slice/docker.service", "mem": { "limit": { "bytes": 9223372036854772000 }, "failures": 0, "usage": { "bytes": 479354880, "max": { "bytes": 9771409408 } } }, "memsw": { "usage": { "bytes": 479932416, "max": { "bytes": 9771569152 } }, "limit": { "bytes": 9223372036854772000 }, "failures": 0 }, "kmem": { "usage": { "bytes": 0, "max": { "bytes": 0 } }, "limit": { "bytes": 9223372036854772000 }, "failures": 0 }, "kmem_tcp": { "failures": 0, "usage": { "bytes": 0, "max": { "bytes": 0 } }, "limit": { "bytes": 9223372036854772000 } }, "stats": { "page_faults": 8550102, "hierarchical_memory_limit": { "bytes": 9223372036854772000 }, "cache": { "bytes": 330055680 }, "pages_in": 19444755, "inactive_file": { "bytes": 123813888 }, "major_page_faults": 495, "mapped_file": { "bytes": 59068416 }, "unevictable": { "bytes": 0 }, "inactive_anon": { "bytes": 48914432 }, "swap": { "bytes": 270336 }, "active_file": { "bytes": 230416384 }, "pages_out": 19335892, "rss": { "bytes": 149323776 }, "rss_huge": { "bytes": 2097152 }, "hierarchical_memsw_limit": { "bytes": 9223372036854772000 }, "active_anon": { "bytes": 75890688 } } }, "blkio": { "id": "docker.service", "path": "/system.slice/docker.service", "total": { "bytes": 8739508224, "ios": 935715 } }, "cgroups_version": 1 }, "memory": { "share": 48500736, "size": 2437496832, "rss": { "bytes": 127614976, "pct": 0.0077 } }, "cmdline": "/usr/bin/dockerd --data-root /mnt/data/docker -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-masq=true --icc=true --log-driver json-file --log-opt max-size=500m --log-opt max-file=10 -G docker --bip=172.17.42.1/16 --raw-logs --live-restore --storage-driver aufs" } }, "ece": { "runner": "192.168.44.10", "zone": "ece-zone-0", "roles": [ "coordinator", "director", "allocator", "proxy", "services-forwarder", "beats-runner" ] }, "agent": { "type": "metricbeat", "version": "7.17.1", "hostname": "ip-192-168-44-10", "ephemeral_id": "3f50049a-23bd-40bc-9d6d-ae97a043ffb0", "id": "f08a462a-26d3-4110-8c01-8ff272ec015e", "name": "192.168.44.10" }, "process": { "memory": { "pct": 0.0077 }, "state": "sleeping", "name": "dockerd", "args": [ "/usr/bin/dockerd", "--data-root", "/mnt/data/docker", "-H", "unix:///var/run/docker.sock", "--ip-forward=true", "--iptables=true", "--ip-masq=true", "--icc=true", "--log-driver", "json-file", "--log-opt", "max-size=500m", "--log-opt", "max-file=10", "-G", "docker", "--bip=172.17.42.1/16", "--raw-logs", "--live-restore", "--storage-driver", "aufs" ], "command_line": "/usr/bin/dockerd --data-root /mnt/data/docker -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-masq=true --icc=true --log-driver json-file --log-opt max-size=500m --log-opt max-file=10 -G docker --bip=172.17.42.1/16 --raw-logs --live-restore --storage-driver aufs", "ppid": 1, "pgid": 1613, "cpu": { "start_time": "2022-03-22T09:16:40.000Z", "pct": 0.0037 }, "pid": 1613 }, "metricset": { "name": "process", "period": 30000 }, "ecs": { "version": "1.12.0" }, "host": { "name": "192.168.44.10" }, "cloud": { "image": { "id": "ami-06603c22a3ef0c326" }, "provider": "aws", "instance": { "id": "i-02bd40008e7e371cf" }, "machine": { "type": "m6i.xlarge" }, "region": "us-east-1", "availability_zone": "us-east-1a", "service": { "name": "EC2" }, "account": { "id": "444732909647" } }, "container": { "image": { "name": "docker.elastic.co/cloud-ci/elastic-cloud-enterprise:3.2.0-git-9bf8f7d8bcd1012e9c56351e78897586d634f3a9" }, "name": "frc-beats-runners-beats-runner", "labels": { "co_elastic_cloud_runner_id": "192.168.44.10", "org_label-schema_build-date": "20201113", "org_opencontainers_image_vendor": "Elastic", "co_elastic_cloud_runner_role": "beats-runner", "co_elastic_cloud_runner_container_name": "beats-runner", "org_label-schema_vcs-ref": "9bf8f7d8bcd1012e9c56351e78897586d634f3a9", "org_label-schema_name": "elastic-cloud-enterprise", "org_label-schema_license": "", "org_label-schema_schema-version": "1.0", "co_elastic_cloud_runner_zone": "ece-zone-0", "org_opencontainers_image_title": "elastic-cloud-enterprise", "co_elastic_cloud_runner_container_set": "beats-runners", "org_opencontainers_image_licenses": "", "co_elastic_vcs-branch": "PR-99348", "org_opencontainers_image_created": "2020-11-13 00:00:00+00:00", "org_label-schema_vendor": "Elastic", "maintainer": "Cloud Enterprise DevelopersDocuments ingested by metricbeat 7.12
Documents ingested by metricbeat 7.12, on the other hand, contain the flattened
system.process.cgroup.id
field.Example document ingested by metricbeat 7.12 (corresponding to the same process as above^)
```json { "_index": "metricbeat-7.12.0-2022.03.23", "_type": "_doc", "_id": "B-WQt38BD8PJXavuELOU", "_version": 1, "_score": 1, "_source": { "@timestamp": "2022-03-23T16:15:45.751Z", "event": { "dataset": "system.process", "module": "system", "duration": 183817657 }, "ece": { "roles": [ "coordinator", "director", "allocator", "proxy", "services-forwarder", "beats-runner" ], "runner": "192.168.44.10", "zone": "ece-zone-0" }, "host": { "name": "192.168.44.10" }, "agent": { "type": "metricbeat", "version": "7.12.0", "hostname": "ip-192-168-44-10", "ephemeral_id": "de5023cd-4254-4ae1-b504-eff772b10254", "id": "7260ff8b-8b9d-426e-9c04-6339d74ddce1", "name": "192.168.44.10" }, "cloud": { "machine": { "type": "m6i.xlarge" }, "provider": "aws", "region": "us-east-1", "availability_zone": "us-east-1a", "account": { "id": "444732909647" }, "image": { "id": "ami-06603c22a3ef0c326" }, "instance": { "id": "i-02bd40008e7e371cf" } }, "container": { "id": "3cd22fbb3ea16878ff5ede31d28a8e4339e8f7c29a422b1c64a1e6159641f351", "image": { "name": "docker.elastic.co/cloud-enterprise/elastic-cloud-enterprise:3.0.0" }, "name": "frc-beats-runners-beats-runner", "labels": { "co_elastic_cloud_runner_container_set": "beats-runners", "co_elastic_cloud_runner_id": "192.168.44.10", "org_label-schema_name": "elastic-cloud-enterprise", "org_opencontainers_image_title": "elastic-cloud-enterprise", "org_label-schema_license": "", "co_elastic_cloud_runner_zone": "ece-zone-0", "org_label-schema_schema-version": "1.0", "org_label-schema_vcs-ref": "84103666926870184cec05f19fd3fabe0f1e3673", "co_elastic_ci_build-tag": "3.0.0-BC_13", "co_elastic_vcs-tag": "3.0.0-BC_13", "org_label-schema_vendor": "Elastic", "org_label-schema_version": "3.0.0-BC_13", "org_label-schema_build-date": "20201113", "maintainer": "Cloud Enterprise DevelopersThe cgroup file in both cases contains the following:
Documents displaying the "/" cgroup ID
Document ingested by metricbeat 7.17.1
Example document
```json { "_index": "metricbeat-7.17.1-2022.03.23", "_type": "_doc", "_id": "OSmlt38B7-W6we8xymG0", "_version": 1, "_score": 1, "_source": { "@timestamp": "2022-03-23T16:39:28.628Z", "event": { "module": "system", "duration": 577325260, "dataset": "system.process" }, "metricset": { "name": "process", "period": 30000 }, "service": { "type": "system" }, "system": { "process": { "state": "sleeping", "cgroup": { "id": "/", "path": "/", "cgroups_version": 1 }, "memory": { "size": 0, "rss": { "bytes": 0, "pct": 0 }, "share": 0 }, "cpu": { "total": { "value": 10, "pct": 0, "norm": { "pct": 0 } }, "start_time": "2022-03-22T09:15:41.000Z" } } }, "cloud": { "account": { "id": "444732909647" }, "image": { "id": "ami-06603c22a3ef0c326" }, "provider": "aws", "instance": { "id": "i-02bd40008e7e371cf" }, "machine": { "type": "m6i.xlarge" }, "region": "us-east-1", "availability_zone": "us-east-1a", "service": { "name": "EC2" } }, "process": { "cpu": { "start_time": "2022-03-22T09:15:41.000Z", "pct": 0 }, "memory": { "pct": 0 }, "name": "kthreadd", "pid": 2, "ppid": 0, "pgid": 0, "state": "sleeping" }, "user": { "name": "root" }, "ece": { "runner": "192.168.44.10", "zone": "ece-zone-0", "roles": [ "coordinator", "director", "allocator", "proxy", "services-forwarder", "beats-runner" ] }, "ecs": { "version": "1.12.0" }, "host": { "name": "192.168.44.10" }, "agent": { "type": "metricbeat", "version": "7.17.1", "hostname": "ip-192-168-44-10", "ephemeral_id": "3f50049a-23bd-40bc-9d6d-ae97a043ffb0", "id": "f08a462a-26d3-4110-8c01-8ff272ec015e", "name": "192.168.44.10" } }, "fields": { "system.process.cpu.total.norm.pct": [ 0 ], "system.process.memory.rss.pct": [ 0 ], "process.name.text": [ "kthreadd" ], "system.process.cpu.total.value": [ 10 ], "process.pid": [ 2 ], "cloud.availability_zone": [ "us-east-1a" ], "service.type": [ "system" ], "mongodb.status.process": [ "kthreadd" ], "system.process.memory.size": [ 0 ], "agent.name": [ "192.168.44.10" ], "host.name": [ "192.168.44.10" ], "cloud.region": [ "us-east-1" ], "ece.runner": [ "192.168.44.10" ], "process.ppid": [ 0 ], "agent.hostname": [ "ip-192-168-44-10" ], "process.name": [ "kthreadd" ], "cloud.machine.type": [ "m6i.xlarge" ], "cloud.provider": [ "aws" ], "agent.id": [ "f08a462a-26d3-4110-8c01-8ff272ec015e" ], "cloud.service.name": [ "EC2" ], "ecs.version": [ "1.12.0" ], "agent.version": [ "7.17.1" ], "process.memory.pct": [ 0 ], "ece.zone": [ "ece-zone-0" ], "user.name": [ "root" ], "system.process.cgroup.path": [ "/" ], "process.state": [ "sleeping" ], "cloud.instance.id": [ "i-02bd40008e7e371cf" ], "agent.type": [ "metricbeat" ], "event.module": [ "system" ], "system.process.cpu.start_time": [ "2022-03-22T09:15:41.000Z" ], "metricset.period": [ 30000 ], "system.process.memory.share": [ 0 ], "system.process.cgroup.id": [ "/" ], "process.pgid": [ 0 ], "system.process.cpu.total.pct": [ 0 ], "system.process.cgroup.cgroups_version": [ 1 ], "metricset.name": [ "process" ], "event.duration": [ 577325260 ], "cloud.image.id": [ "ami-06603c22a3ef0c326" ], "system.process.memory.rss.bytes": [ 0 ], "@timestamp": [ "2022-03-23T16:39:28.628Z" ], "process.cpu.start_time": [ "2022-03-22T09:15:41.000Z" ], "cloud.account.id": [ "444732909647" ], "ece.roles": [ "coordinator", "director", "allocator", "proxy", "services-forwarder", "beats-runner" ], "system.process.state": [ "sleeping" ], "agent.ephemeral_id": [ "3f50049a-23bd-40bc-9d6d-ae97a043ffb0" ], "process.cpu.pct": [ 0 ], "event.dataset": [ "system.process" ], "user.name.text": [ "root" ] } } ```Document ingested by metricbeat 7.12
With metricbeat 7.12, no documents contained the
/
cgroup ID. Most system processes just had no value there (-
).Example document for the same process as above ^
```json { "_index": "metricbeat-7.12.0-2022.03.23", "_type": "_doc", "_id": "c7jgt38BExF76hNc3WfU", "_version": 1, "_score": 1, "_source": { "@timestamp": "2022-03-23T17:44:01.284Z", "event": { "duration": 186129521, "dataset": "system.process", "module": "system" }, "service": { "type": "system" }, "host": { "name": "192.168.44.10" }, "ecs": { "version": "1.8.0" }, "cloud": { "account": { "id": "444732909647" }, "provider": "aws", "image": { "id": "ami-06603c22a3ef0c326" }, "instance": { "id": "i-02bd40008e7e371cf" }, "machine": { "type": "m6i.xlarge" }, "region": "us-east-1", "availability_zone": "us-east-1a" }, "process": { "ppid": 0, "pgid": 0, "state": "sleeping", "cpu": { "start_time": "2022-03-22T09:15:41.000Z", "pct": 0 }, "memory": { "pct": 0 }, "name": "kthreadd", "pid": 2 }, "user": { "name": "root" }, "metricset": { "period": 30000, "name": "process" }, "system": { "process": { "state": "sleeping", "memory": { "share": 0, "size": 0, "rss": { "pct": 0, "bytes": 0 } }, "cpu": { "total": { "value": 20, "pct": 0, "norm": { "pct": 0 } }, "start_time": "2022-03-22T09:15:41.000Z" } } }, "ece": { "zone": "ece-zone-0", "roles": [ "coordinator", "director", "allocator", "proxy", "services-forwarder", "beats-runner" ], "runner": "192.168.44.10" }, "agent": { "ephemeral_id": "4edd277e-aa86-4234-892d-bb706595876a", "id": "3e91381e-0e4c-45ea-89e6-b5a910ef395e", "name": "192.168.44.10", "type": "metricbeat", "version": "7.12.0", "hostname": "ip-192-168-44-10" } }, "fields": { "process.memory.pct": [ 0 ], "system.process.cpu.total.norm.pct": [ 0 ], "system.process.memory.rss.pct": [ 0 ], "process.name.text": [ "kthreadd" ], "system.process.cpu.total.value": [ 20 ], "ece.zone": [ "ece-zone-0" ], "user.name": [ "root" ], "process.pid": [ 2 ], "process.state": [ "sleeping" ], "cloud.availability_zone": [ "us-east-1a" ], "service.type": [ "system" ], "cloud.instance.id": [ "i-02bd40008e7e371cf" ], "mongodb.status.process": [ "kthreadd" ], "agent.type": [ "metricbeat" ], "system.process.memory.size": [ 0 ], "event.module": [ "system" ], "agent.name": [ "192.168.44.10" ], "host.name": [ "192.168.44.10" ], "system.process.cpu.start_time": [ "2022-03-22T09:15:41.000Z" ], "cloud.region": [ "us-east-1" ], "ece.runner": [ "192.168.44.10" ], "process.ppid": [ 0 ], "metricset.period": [ 30000 ], "system.process.memory.share": [ 0 ], "agent.hostname": [ "ip-192-168-44-10" ], "process.pgid": [ 0 ], "system.process.cpu.total.pct": [ 0 ], "metricset.name": [ "process" ], "event.duration": [ 186129521 ], "cloud.image.id": [ "ami-06603c22a3ef0c326" ], "system.process.memory.rss.bytes": [ 0 ], "process.name": [ "kthreadd" ], "cloud.machine.type": [ "m6i.xlarge" ], "cloud.provider": [ "aws" ], "@timestamp": [ "2022-03-23T17:44:01.284Z" ], "process.cpu.start_time": [ "2022-03-22T09:15:41.000Z" ], "agent.id": [ "3e91381e-0e4c-45ea-89e6-b5a910ef395e" ], "cloud.account.id": [ "444732909647" ], "ece.roles": [ "coordinator", "director", "allocator", "proxy", "services-forwarder", "beats-runner" ], "ecs.version": [ "1.8.0" ], "system.process.state": [ "sleeping" ], "agent.ephemeral_id": [ "4edd277e-aa86-4234-892d-bb706595876a" ], "agent.version": [ "7.12.0" ], "process.cpu.pct": [ 0 ], "event.dataset": [ "system.process" ], "user.name.text": [ "root" ] } } ```Note that in 7.12 the document ^ contains no cgroup information whatsoever. The cgroup file contains just:
Config:
metricbeat.yml
```yaml name: "192.168.44.10" metricbeat.modules: - module: system # most metricsets, 30s metricsets: - cpu - diskio - load - memory - network - process - process_summary - socket_summary enabled: true period: 30s - module: system # fsstat metricset, 5m metricsets: - fsstat - filesystem enabled: true period: 5m - module: docker metricsets: - network - memory - cpu - container - info hosts: ["unix:///run/docker.sock"] enabled: true period: 1m labels.dedot: false fields_under_root: true # affects these global fields only fields: ece.runner: "192.168.44.10" ece.zone: "ece-zone-0" ece.roles: ${BEATS_RUNNER_ROLES} metricbeat.autodiscover: providers: - type: docker labels.dedot: false templates: - condition: or: - equals: docker.container.name: "frc-proxies-proxyv2" config: - module: prometheus metricsets: - collector period: 1m hosts: ["localhost:9000"] metrics_path: /metrics use_types: true processors: - add_cloud_metadata: ~ - add_docker_metadata: host: "unix:///run/docker.sock" match_fields: ["system.process.cgroup.id"] cleanup_timeout: 604800 - rename: fields: - from: "docker.container.labels.co.elastic.cloud.allocator.cluster_id" to: "ece.cluster" - from: "docker.container.labels.co.elastic.cloud.allocator.zone" to: "ece.zone" - from: "docker.container.labels.co.elastic.cloud.allocator.instance_id" to: "ece.instance" - from: "docker.container.labels.co.elastic.cloud.allocator.kind" to: "ece.kind" ignore_missing: true fail_on_error: false - drop_fields: fields: - docker.container.command - docker.container.labels.author - docker.container.labels.co.elastic.ci - docker.container.labels.co.elastic.ci.build-tag - docker.container.labels.co.elastic.ci.worker - docker.container.labels.co.elastic.vcs-branch - docker.container.labels.org.label-schema.docker - docker.container.labels.org.label-schema.vendor - docker.container.labels.org.label-schema.build-date - docker.container.labels.org.label-schema.license - docker.container.labels.org.label-schema.name - docker.container.labels.org.label-schema.schema-version logging.level: info logging.to_files: true logging.to_syslog: false logging.files: path: /app/logs name: metricbeat.log setup.ilm.enabled: false setup.template: enabled: false overwrite: false name: "metricbeat-%{[agent.version]}" pattern: "metricbeat-%{[agent.version]}-*" output.elasticsearch: hosts: ["localhost:9244"] username: "ece-logging-metrics-ingest" password: "xyz" headers: X-Found-Cluster: xyz bulk_max_size: 500 # upstream default 50 index: "metricbeat-%{[agent.version]}-%{+yyyy.MM.dd} ```Version
Metricbeat 7.17.1
Operating System:
Observed in Elastic Cloud Enterprise, where metricbeat runs in a Docker container on a Ubuntu 18.04 host.