elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
123 stars 131 forks source link

Agent sporadically reports high CPU for constituent processes #3825

Open ycombinator opened 9 months ago

ycombinator commented 9 months ago

Please post all questions and issues concerning the Elastic Agent on https://discuss.elastic.co/c/beats before opening a Github Issue. Your questions will reach a wider audience there, and if we confirm that there is a bug, then you can open a new issue.

For security vulnerabilities please only send reports to security@elastic.co. See https://www.elastic.co/community/security for more information.

Please include configurations and logs if available.

For confirmed bugs, please report:

Not sure of the steps to reproduce but we've had a user who reported that an Agent in the Fleet Agents listing page showed the CPU utilization % reported in the thousands.

high_cpu_agent

Upon further investigation, it was revealed that one of the Filebeat components being run by the Agent would occasionally report an unusually high value in the CPU ticks field. This field is supposed to be a counter metric, so the ES query in Fleet performs a derivative aggregation on that field. When an unusually high value is seen in the field, the derivative (change) from the previous value of the field in the time series is also high.

elasticmachine commented 9 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pierrehilbert commented 9 months ago

@AndersonQ you currently have some issue related to this topic. Don't hesitate to bring more information here to help to solve this problem.

AndersonQ commented 9 months ago

EDIT: see https://github.com/elastic/elastic-agent/issues/3825#issuecomment-1830485466

I've seen something similar. Using the following command to parse the logs from a diagnostics I got results showing the same behaviour. The beat occasionally reporting an unusually high value in the cpu.total.ticks field of the Non-zero metrics in the last 30s log.

cat logs/elastic-agent-a92ca3/elastic-agent-* | grep -e '"component":{"binary":"filebeat"' | grep -e 'Non-zero metrics in the last 30s' | grep -e '},"total":{"ticks"' | jq -r '[.["@timestamp"], .component.binary, .monitoring.metrics.beat.cpu.total.ticks] | @tsv'

Those are a few entries showing the oddly-high value:

2023-11-23T15:16:52.265Z    filebeat    14546
2023-11-23T15:17:22.239Z    filebeat    128218
2023-11-23T15:17:22.239Z    filebeat    94671
2023-11-23T15:17:22.267Z    filebeat    14624
2023-11-23T15:17:52.235Z    filebeat    94889
2023-11-23T15:17:52.235Z    filebeat    128796
2023-11-23T15:17:52.274Z    filebeat    14734
2023-11-23T15:18:22.235Z    filebeat    129171
2023-11-23T15:18:22.235Z    filebeat    95171
2023-11-23T15:18:22.265Z    filebeat    14765
cmacknz commented 9 months ago

The increased value seems too low for it to be the usual suspect of unhandled counter wrap around (we'd expect a value closer to uint32 or uint64 max).

The units of ticks are platform dependent CPU time, I think this is the same field reported by the system process metricset:

https://github.com/elastic/beats/blob/6080d9b7b3c5cc81782e06336f66a4ffc507eb03/metricbeat/module/system/process/_meta/fields.yml#L81-L84

        - name: total.ticks
          type: long
          description: >
            The total CPU time spent by the process.

So my first question would be if the increase in ticks is actually wrong when this happens, or if it is properly accounting for an brief increase in CPU time and then the way we calculate the CPU percentage is wrong.

I think the beat.cpu.total.ticks value comes from the elastic-agent-system-metrics package, setup here:

https://github.com/elastic/beats/blob/6080d9b7b3c5cc81782e06336f66a4ffc507eb03/libbeat/cmd/instance/beat.go#L353-L356

This then ends up in https://github.com/elastic/elastic-agent-system-metrics/blob/dd6d8b775490db422595659bad9bd5bbb4360247/metric/system/process/process.go#L154-L165

On Linux I think the total CPU ticks is eventually calculated in https://github.com/elastic/elastic-agent-system-metrics/blob/dd6d8b775490db422595659bad9bd5bbb4360247/metric/system/process/process_linux_common.go#L372-L376

    // convert to milliseconds from USER_HZ
    // This effectively means our definition of "ticks" throughout the process code is a millisecond
    state.User.Ticks = opt.UintWith(user * (1000 / ticks))
    state.System.Ticks = opt.UintWith(sys * (1000 / ticks))
    state.Total.Ticks = opt.UintWith(opt.SumOptUint(state.User.Ticks, state.System.Ticks))

CC @fearful-symmetry in case they have seen this before.

AndersonQ commented 9 months ago

I've seen something similar. Using the following command to parse the logs from a diagnostics I got results showing the same behaviour. The beat occasionally reporting an unusually high value in the cpu.total.ticks field of the Non-zero metrics in the last 30s log.

cat logs/elastic-agent-a92ca3/elastic-agent-* | grep -e '"component":{"binary":"filebeat"' | grep -e 'Non-zero metrics in the last 30s' | grep -e '},"total":{"ticks"' | jq -r '[.["@timestamp"], .component.binary, .monitoring.metrics.beat.cpu.total.ticks] | @tsv'

Those are a few entries showing the oddly-high value:

2023-11-23T15:16:52.265Z  filebeat    14546
2023-11-23T15:17:22.239Z  filebeat    128218
2023-11-23T15:17:22.239Z  filebeat    94671
2023-11-23T15:17:22.267Z  filebeat    14624
2023-11-23T15:17:52.235Z  filebeat    94889
2023-11-23T15:17:52.235Z  filebeat    128796
2023-11-23T15:17:52.274Z  filebeat    14734
2023-11-23T15:18:22.235Z  filebeat    129171
2023-11-23T15:18:22.235Z  filebeat    95171
2023-11-23T15:18:22.265Z  filebeat    14765

my bad, I mixed several filebeat process here. The correct command should either filter by component ID:

cat logs/elastic-agent-a92ca3/elastic-agent-* | grep -e '"id":"log-default"' | grep -e 'Non-zero metrics in the last 30s' | jq -r '[.["@timestamp"], .component.id, .monitoring.metrics.beat.cpu.total.ticks] | @tsv'

2023-11-23T15:15:22.241Z    log-default 93421
2023-11-23T15:15:52.248Z    log-default 93702
2023-11-23T15:16:22.254Z    log-default 94140
2023-11-23T15:16:52.235Z    log-default 94406
2023-11-23T15:17:22.239Z    log-default 94671
2023-11-23T15:17:52.235Z    log-default 94889
2023-11-23T15:18:22.235Z    log-default 95171

OR display the component.id istead of component.binay:

 cat logs/elastic-agent-a92ca3/elastic-agent-* | grep -e '"component":{"binary":"filebeat"' | grep -e 'Non-zero metrics in the last 30s' | jq -r '[.["@timestamp"], .component.id, .monitoring.metrics.beat.cpu.total.ticks] | @tsv'

2023-11-23T15:17:22.239Z    log-default 94671
2023-11-23T15:17:22.267Z    filestream-monitoring   14624
2023-11-23T15:17:52.235Z    log-default 94889
2023-11-23T15:17:52.235Z    winlog-default  128796
2023-11-23T15:17:52.274Z    filestream-monitoring   14734
2023-11-23T15:18:22.235Z    winlog-default  129171
2023-11-23T15:18:22.235Z    log-default 95171
2023-11-23T15:18:22.265Z    filestream-monitoring   14765

which indicates my data seems to be related to https://github.com/elastic/kibana/issues/172004 instead of this issue

fearful-symmetry commented 9 months ago

Is this just a normalization thing? A lot of the CPU metrics are non-normalized, which means that it's ticks across all CPU cores; for example, on an 8-core system, we can get a maximum CPU utilization of 800%. I wonder if the fact that we're using non-normalized ticks plus summing metrics from multiple processes over a given period of time just produces some funky numbers.

AndersonQ commented 9 months ago

Is this just a normalization thing? A lot of the CPU metrics are non-normalized, which means that it's ticks across all CPU cores; for example, on an 8-core system, we can get a maximum CPU utilization of 800%. I wonder if the fact that we're using non-normalized ticks plus summing metrics from multiple processes over a given period of time just produces some funky numbers.

Oh, I didn't know that, I was assuming it could only reach 100%. That could very well explain what we're seen here

RaphSimard commented 9 months ago

I have the same issue as the user you mentionned. I am running on Elastic Agent 8.10.4. I see Linux and Windows agents with 5000% CPU.

fearful-symmetry commented 9 months ago

The number is definitely reported in a way that's a tad confusing. At minimum:

The fact that this aggregates values from the agent and the beats under it is also going to produce some weird values that are > 100. I feel like we should rethink the usefulness of this number; perhaps the UI should break down CPU usage by beats?

cmacknz commented 8 months ago

We should be presenting a normalized value between 0-100 regardless of the number of cores or sub-processes that are running for the agent overall. Doing anything else makes comparing the CPU usage between agents impossible.

Breaking down the CPU allocation per input would also be valuable, which should also be a normalized number between 0-100 regardless of the number of cores on the machine.

fearful-symmetry commented 8 months ago

Seconding the idea of breaking it down per-input.

I guess normalizing per-input would be possible too. However, at a certain point I wonder if we're going to get so creative with massaging this number that we'll just end up confusing people in a totally different way. I feel like there has to be a happy medium here between over-simplifying and cluttering up the UI.

I've run into this problem a lot with memory usage metrics over the years. Simple percentages abstract a lot of the complexity in how an OS manages and "views" memory, which will always create some amount of edge cases where a certain memory usage profile will create a number that can be deceptive, or mismatch the mental model of the user.