[Metricbeat][System]`system.process.state` reports `sleeping`

crespocarlos commented 7 months ago

Version: 8.12

Description of the problem including expected versus actual behavior: Various users have surfaced that processes show as 'sleeping' within the Host 'processes' view which is leading to customers not trusting the 'processes' functionality offered by Elastic.

Example customer issue Example customer issue

It seems that the majority of processes are reporting as 'sleeping':

Discover - breaking down processes by state (demo cluster&_a=(breakdownField:process.state,columns:!(),filters:!(),index:dccd1810-2016-11eb-8016-cf9f9e5961e9,interval:auto,query:(language:kuery,query:'process.state%20:%20*'),sort:!(!('@timestamp',desc)))))

Host processes ([demo cluster](https://demo.elastic.co/app/metrics/hosts?waffleTime=(currentTime:1708694008109,isAutoReloading:!f)&_a=(dateRange:(from:now-15m,to:now),filters:!(),limit:100,panelFilters:!(),query:(language:kuery,query:%27%27))&controlPanels=(cloud.provider:(explicitInput:(fieldName:cloud.provider,id:cloud.provider,title:%27Cloud%20Provider%27),grow:!f,order:1,type:optionsListControl,width:medium),host.os.name:(explicitInput:(fieldName:host.os.name,id:host.os.name,title:%27Operating%20System%27),grow:!f,order:0,type:optionsListControl,width:medium),service.name:(explicitInput:(fieldName:service.name,id:service.name,title:%27Service%20Name%27),grow:!f,order:2,type:optionsListControl,width:medium))&tableProperties=(detailsItemId:gke-demo-elastic-co-pool-5-29a9d3db-m49n-Ubuntu,pagination:(pageIndex:0,pageSize:5),sorting:(direction:desc,field:cpu))&assetDetails=(dateRange:(from:%272024-02-23T12:58:30.004Z%27,to:%272024-02-23T13:13:30.005Z%27),name:gke-demo-elastic-co-pool-5-29a9d3db-m49n,tabId:processes)))

Could we confirm:

What 'sleeping' means? The UI just looks at the system.process.state field
How is it calculated?
Is this working correctly? e.g. are most processes expected to show as 'sleeping'?

Steps to reproduce:

Spin up a host
Deploy Elastic Agent (or metricbeat)
Check the process.state fields being emitted in Discover (or the host experience&waffleTime=(currentTime:1708694008109,isAutoReloading:!f)))

Note : We'll try and get some direct customer feedback to help debug this

elasticmachine commented 7 months ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine commented 7 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

roshan-elastic commented 7 months ago

Sorry @tomgregoryelastic, we had to move this to another repo.

Are you able to provide any detail of the customer who is currently asking about this? Even better, they could directly communicate on this issue?

Anything to help us understand if this is a bug (and if so, what is the root cause)...

cmacknz commented 7 months ago

Sleeping is a valid Linux process state, and is therefore in the list of process states our system metrics code can report: https://github.com/elastic/elastic-agent-system-metrics/blob/01e9cf49607b993bb398e7925274fffcf5128304/metric/system/process/process_common.go#L139-L153

// PidStates is a Map of all pid states, mostly applicable to linux
var PidStates = map[byte]PidState{
    'S': Sleeping,
    'R': Running,
    'D': DiskSleep, // Waiting in uninterruptible disk sleep, on some kernels this is marked as I below
    'I': Idle,      // in the scheduler, TASK_IDLE is defined as (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
    'T': Stopped,
    'Z': Zombie,
    'X': Dead,
    'x': Dead,
    'K': WakeKill,
    'W': Waking,
    'P': Parked,
}

https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf explains what sleeping means, here is part of the explanation:

A process enters a Sleeping state when it needs resources that are not currently available. At that point, it either goes voluntarily into Sleep state or the kernel puts it into Sleep state. Going into Sleep state means the process immediately gives up its access to the CPU. When the resource the process is waiting on becomes available, a signal is sent to the CPU. The next time the scheduler gets a chance to schedule this sleeping process, the scheduler will put the process either in Running or Runnable state. Here is an example of how a login shell goes in and out of sleep state: • You type a command and the shell goes into Sleep state and waits for an event to occur. • The shell process sleeps on a particular wait channel (WCHAN). • When an event occurs, such as an interrupt from the keyboard, every process waiting on that wait channel wakes up. To find out what wait channels processes are waiting on for your system, type ps -l (to see processes associated with the current shell) or ps -el (to see all processes on the system). If a process is in Sleep state, the WCHAN field shows the system call that the process is waiting on.

To find out if this is a bug, the first thing to check would be if other Linux tools like top also report the process state as S or sleeping.

If they do, then this is correct. If they don't, then this is something we can investigate. I should note that the process state will change over time, what is important is that the states we are reporting to Elasticsearch and Kibana match what was happening on the system at that time.

crespocarlos commented 5 months ago

@cmacknz there was recently an SDH that touched upon this very issue: https://github.com/elastic/sdh-kibana/issues/4598. There are more details in there, which might be useful to determine what might be happening.

jvalente-salemstate commented 5 months ago

I'm also encountering this. In almost all my tests I'm seeing a sleeping status on RHEL and Ubuntu with the only exception being a few zombie processes and one case of seeing a running process today. They're running when I check top and event Elastic Agent and its child processes show as sleeping. One of the agent processes was the one I did see running today.

@crespocarlos I can't seem to view that repo, but if there's anything helpful in there could it be shared? I'd be glad to provide any additional datapoints if they can assist in resolving this.

The two directions I was leaning towards (at least before seeing the one running process) were

A) the calls being blocked by AppArmor as described in the Linux Metrics

Certain data streams may access /proc to gather process information, and the resulting ptrace_may_access() call by the kernel to check for permissions can be blocked by AppArmor and other LSM software, even though the System module doesn't use ptrace directly.

If these are the only calls being made I am willing to write up a profile for it. When I tried last time I got hung up on the path (hash changes between versions). I'm not seeing any AppArmor events for elastic in auditd though. I see other processes so it may not be this.

B) I increased the interval to cut down in log volume and it's very possible that a 30 second interval just is almost never catching these.

cmacknz commented 5 months ago

They're running when I check top and event Elastic Agent and its child processes show as sleeping. One of the agent processes was the one I did see running today

This definitely seems wrong. @fearful-symmetry any ideas on what might be causing this?

fearful-symmetry commented 5 months ago

So, at least for certain workloads, this isn't too weird, at any given time the majority of processes on a system are likely to be in an S state (again, depending on the workload and config of a server):

cat /proc/*/stat | cut -d ' ' -f 3 | sort | uniq -c
    122 I
      8 R
    205 S
      1 Z

I suspect that for a great many user workloads, "how many processes are in a running state?" is not useful information; for example, while I ran that above command, the server was performing a build operation, and using between 50-99% CPU. However, that's a separate issue.

(note that the I is for idle kernel threads, and kernel processes probably aren't getting reported on most systems due to a bug.

They're running when I check top and event Elastic Agent and its child processes show as sleeping. One of the agent processes was the one I did see running today.

I'm a tad confused by the wording here. Are you saying that the child processes of agent itself were incorrectly reporting as the wrong state? If this is happening while monitoring the host system under docker, it may be related to a bug I found yesterday.

I don't see anything in the code that would cause us to fallback to an S state where there isn't one. If something fails while either monitoring the process we'll either skip the PID entirely and report an error in the debug logs, or report the state as unknown.

Note that it's also possible for beats to "miss" a PID's state if it doesn't line up with the period that metricbeat is running with. For example, if you have a process that's mostly is in a S state, but then wakes every few seconds to perform a short operation that takes less than 10 seconds, and metricbeat is running with a reporting period of 10s, it's possible that metricbeat will only ever report the PID as S.

If a user can verify that there's a consistent error in the reported state for a PID, the best route would be to increase the period to see if that fixes it, and if not, enable debug logging, and see if metricbeat reports any particular errors with the PID.

jvalente-salemstate commented 5 months ago

I'm a tad confused by the wording here. Are you saying that the child processes of agent itself were incorrectly reporting as the wrong state? If this is happening while monitoring the host system under docker, it may be related to a bug I found yesterday.

@fearful-symmetry oh sorry. I meant that /opt/Elastic/Agent/elastic-agent consistently shows as sleeping, as do processes like /opt/Elastic/Agent/data/elastic-agent-8.13.2-6d53d8/components/metricbeat. I'd expect Metricbeat to at least show sometimes as running if it's running to collect metrics. The one time I was able to see something as 'running' within elastic was filebeat. Oddly, even top itself show as as sleeping in elastic, even while I watch it in my terminal maintaining a status of R. It does occasionally move lower than where I can see it while watching, so it may also drop into sleeping then.

However, I do agree that the likely cause is really just timing of when it runs. This plus limited number of top n processes + a third to half of them--at least on our hosts--being used up by Elastic Agent and any of the beats processes it launches are obscuring anything that could be in a running state when metricbeat collects its metrics. This ends up giving the appearance that nothing is actually running and that something is broken. Grabbing some examples, I did run into other cases where something was definitely wrong though.

Here on one host 9/10 of the top processes for memory are elastic

and 9/10 of the top processes by cpu are also elastic. There is one running process but it's not in the top n.

In some cases, there do seem to be actual errors. Here, I was able to capture something with a running status but the count is N/A .

Kibana, regardless of the starting timestamp, say it's the 1 minute preceding the end timestamp, and that it's aggregated.

Showing process data collected for the 1 minute preceding Apr 25, 2024 @ 20:07:56

The processes listed are based on an aggregation of the top CPU and the top memory consuming processes. It does not show all processes.

Is it possible that the issue is the aggregation itself and how it handles status changes?

jvalente-salemstate commented 5 months ago

Oh also, not on docker. Though, noting they're Ubuntu and RHEL (7/9) VMs on ESXi hosts may help if it's a specific virtualization thing here as well.

With a bit of testing, and looking at logs.

Sleeping vs running occurs in about 98-99% sleeping & 1-2% running for of metrics-* in any given period--with a disk_sleep as well in case, with any selection of (rhel, ubuntu) hosts. The majority, ~90% are metricbeat itself. Only 21% of the total metrics of are from metricbeat. For the reasons I'd expect that to be the case. The ratio does match up roughly to what I'd get with cat /proc/*/stat though. So they're being indexed correctly.

Running for ((i=0;i<30;i++)); do cat /proc/*/stat | cut -d ' ' -f 3 | sort | uniq -c;sleep 5;done;

Got periodic output that I could compare to the host in kibana. Not the most reliable method because of timing but this output would consistently remain at 238±2. Within Kibana is idle, zombie, and sleeping are matches to the output. Except it showsN/A running processes, and the reported number of process is exactly the number of running processes less than the output.

....
cat: /proc/net/stat: Is a directory
     84 I
      3 R
    150 S
      1 Z
cat: /proc/net/stat: Is a directory
     84 I
      3 R
    149 S
      1 Z
...
cat: /proc/net/stat: Is a directory
     84 I
      3 R
    150 S
      1 Z
....

fearful-symmetry commented 5 months ago

Oddly, even top itself show as as sleeping in elastic, even while I watch it in my terminal maintaining a status of R. It does occasionally move lower than where I can see it while watching, so it may also drop into sleeping then.

Is it possible that the issue is the aggregation itself and how it handles status changes?

Yeah, I get the feeling that we're running into the limits of how aggregation works. Or perhaps it's just a bug?

I suspect that a certain amount of this is also metricbeat missing processes that happen between query periods.

cat /proc/*/stat | grep ' R ' | cut -d ' ' -f 2
(cat)
(cat)
(cat)

In this case, the cat call isn't going to last long enough to get reported by metricbeat.

It's also important to note that sleep is a normal state for a process that isn't CPU-bound. If a process is waiting for I/O or some kind if interrupt, it'll report as sleeping. This is where most applications will spend their time.

fearful-symmetry commented 5 months ago

Also, can confirm that the aggregation that counts the process state definitely seems to struggle.

jvalente-salemstate commented 5 months ago

I did a bit of digging last night and now I don't think it's the aggregation.

The aggregation actually seems correct and is just grabbing the most recent timestamp from system.process.summary. That dataset just consists of sparse info on counts of statuses in /proc/*/status.
Metricbeat itself is generating that data. It's pulling in dependencies from an elastic agent repo, which pulls in a Go library, process.go for this as well.
The process counts at the top and the process details at the bottom are using different data (which explains why those show different status totals) which isn't apparent from the dashboard.

I can't tell if there's an issue in Metricbeat, the elastic library, or the Go library. Maybe it's timing again, how it's interpreting what is or isn't flapping (though that was in heartbeat I think, not Metricbeat), or somewhere along the line it's also parsing statuses wrong and missing running statuses along the way.

It's entirely possible nothing is wrong and, as your issue notes, the dashboard is just confusing in what its actually showing.

fearful-symmetry commented 5 months ago

@jvalente-salemstate yeah, as you noted, the visualization depends on both the process metricset and process_summary metricset, which is a bit non-obvious. I've filed an issue in the Kibana repo to hopefully improve this a bit, since even I got confused by it.

fearful-symmetry commented 5 months ago

As far as the actual behavior itself, unless we have evidence of a bug, I think our best conclusion is that it's a combination of how linux scheduling works combined with the limited granularity of the time-based reporting of metricbeat.

roshan-elastic commented 5 months ago

Thanks everyone for the detailed diving on this.

I have a question - do we know if, given the correct configuration, this would reasonably replicate the what a top command would show?

As PM on the UI feature here, I'm trying to figure out:

Is something broken?
Is better guidance needed?
Are better expectations/clarifications needed?

We do fairly consistently get negative feedback about the 'accuracy' of this and I'm trying to figure out what options we might have at our disposal to address why users are complaining.

I'm wondering if we're in a position to suggest some paths forward?

jvalente-salemstate commented 5 months ago

I think the best solution may just be removing or heavily reworking the process summary. On the rework end, I'm not particularly sure what a good alternative is. Maybe modifying the aggregation to look back a bit more and break down process counts for what's still "active" and what have been killed in that period, and maybe the same metric on the count of distinct parent pid?

The way it's reported does require a more in depth understanding of both Windows and Linux as the results swing to opposing extremes.

On Linux most will just show as sleeping rather than running. Other states seem to be showing as one would expect.
On Windows, i looked at our metrics over the last week: out of over 2.1 million events in system.process.summary everything is showing as Running except 216 documents where one process is sleeping.

So sleeping vs running on Linux has little information of use while Running vs anything on Windows has the same issue.

Personally, I would see more value in removing those metrics and just adding more context and data to the topN metrics.

I'll keep it light unless you'd prefer it here and not a separate FR but at a high level:

parent process ID
Kibana seems to be locking the top N to 10 even when set higher. Sorting gives correct results but anything over than 10 is truncated. Using the extra space for more processes.
cgroup info, if enabled.
I have to doublecheck if the dataset even captures it but maybe handles and listening ports.

elastic / beats

[Metricbeat][System]`system.process.state` reports `sleeping` #38120