atc0005 / todo

A collection of TODO items not specific to any one project
MIT License
0 stars 0 forks source link

Research options for building Nagios plugin to detect processes in an "Uninterruptible Sleep (D)" state #47

Closed atc0005 closed 1 year ago

atc0005 commented 2 years ago

Overview

This ps recipe will detect & list them, including what got them stuck in that state:

ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep " D"

Detecting any in a D state is sufficient cause to report the problem. Perhaps depending on what process ends up in that state could elevate the service state severity.

References

Background

Go-specific

atc0005 commented 2 years ago

pseudocode:

  1. cd /proc/
  2. ls (filter to directories named with all digits ^[0-9]+)
  3. look at the */status file
  4. look at the line with State containing D

E.g.,

$ cd /proc
$ grep State */status | tail
934/status:State:       I (idle)
935/status:State:       I (idle)
938/status:State:       I (idle)
939/status:State:       I (idle)
940/status:State:       S (sleeping)
941/status:State:       S (sleeping)
976/status:State:       S (sleeping)
978/status:State:       S (sleeping)
self/status:State:      R (running)
thread-self/status:State:       R (running)
atc0005 commented 2 years ago

Per https://www.baeldung.com/linux/uninterruptible-process:

3.3. Methods to Stop a Process in Uninterruptible Sleep

If we ever encounter a process into uninterruptible sleep, we need to check our hardware. If we encounter the issue when using network storage, it might be down, and the process is waiting for the server to recover. Once we know the driver that is causing the trouble, we can stop it. We might need rmmod to remove the module supporting the hardware device.

Another alternative is to use the parent process identifier of the process in uninterruptible sleep. We can get the identifier of the parent process (known as PPID) and stop this process. This is sufficient for cases where the parent process is an errant shell. Killing the parent process kills the child processes, which may trigger the explicit call required by the process in uninterruptible sleep.

Finally, the last solution when nothing else works is to suspend-to-disk or restart the system. We can try first to suspend-to-disk (also known as hibernate) and resume to see if this unfreezes the process in uninterruptible sleep. If this does not work, we have to restart the system. We might not be able to restart some systems, for example, a connected network device. In this case, we should attempt to unfreeze the process with the previous methods.

Of particular note:

Another alternative is to use the parent process identifier of the process in uninterruptible sleep. We can get the identifier of the parent process (known as PPID) and stop this process. This is sufficient for cases where the parent process is an errant shell. Killing the parent process kills the child processes, which may trigger the explicit call required by the process in uninterruptible sleep.

This corresponds to the ppid value listed by the ps "recipe" listed in the OP:

ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32 | grep " D"

Having that would prove useful when hotfixing a particular issue.

atc0005 commented 2 years ago

pseudocode:

  1. cd /proc/
  2. ls (filter to directories named with all digits ^[0-9]+)
  3. look at the */status file
  4. look at the line with State containing D

As noted on https://www.baeldung.com/linux/process-states:

3.3. The /proc Pseudo File

The /proc pseudo filesystem contains all the information about the processes in our system. Hence, we could directly read the state of a process through this pseudo filesystem. The downside of this approach is we’ll first need to know the PID of the process before we can read its state.

To obtain the state of a process, we can extract the value from its pseudo status file under /proc/{pid}/status. For example, we can get the state of the process with PID 2519 by reading the file /proc/2519/status:

$ cat /proc/2519/status | grep State
State:    S (sleeping)
atc0005 commented 2 years ago

When parsing */status files, it's probably worth grabbing these details:

Having this available in the list of processes in D state would help with troubleshooting efforts.

atc0005 commented 1 year ago

Created new project: https://github.com/atc0005/check-process