Open portante opened 2 years ago
This is a long standing problem with the sar
tool.
Basically, sar
is a a simple wrapper that sysstat
provides around the sadc
command. The sar
command creates a child process which runs the sadc
command itself. The pbench sar
"datalog" exec
s the sar
command.
However, the pbench sar
tool support records the PID of the parent sar
command, and issues a kill -TERM
of that pid. But the sar
command just immediately dies from the KILL leaving the sadc
process running in the background. The sadc
process will only be cleaned up with it tries to communicate with its parent process and then dies via a SIGPIPE
.
When long intervals are used for the pbench sar
tool settings, the sadc
process will wait around up to that amount of time before noticing the SIGPIPE
. E.g. a 120 second interval would create a situation where the sadc
process would wait, at most, another 120 seconds before actually exiting after its sar
parent process was killed.
There are a couple of ways to solve this problem:
sar
tool to run sadc
directly, and then use the sar
command during the "stop post-processing" phase to create the expected outputsar
tool to wait for the sadc
process to exitA third way is to replace the use of sysstat
with PCP entirely.
the pbench
sar
tool support records the PID of the parentsar
command, and issues akill -TERM
of that pid. But thesar
command just immediately dies from the KILL leaving thesadc
process running in the background.
Do we understand why the kill -TERM
affects only the target process and not its subprocess? Should we be trying to kill the who job tree? (Obviously, using PCP would be preferable, but perhaps this wouldn't be hard?)
Not entirely sure, but I used strace
on the parent sar
process and on the child sadc
process. When a SIGTERM
is sent to the parent sar
process, it is an unhandled signal, so the process makes no other systems calls and just dies. The child sadc
process eventually wakes up at the interval and dies with a SIGPIPE
.
When the child sadc
process receives a SIGTERM
, it also does not have a signal handler installed and just dies, but the parent process detects the child dying (via SIGCHILD
) and reports, "End of data collecting unexpected."
So I think we want to SIGTERM
the parent sar
process, and then SIGTERM
the child sadc
process.
I think we want to SIGTERM the parent sar process, and then SIGTERM the child sadc process.
👍
This represents a class of work that we need, but not for the v0.71 release.
Let me recommend you one function from Avocado ;-) https://github.com/avocado-framework/avocado/blob/master/avocado/utils/process.py#L235 (search for kill_process_tree
). Maybe you can re-use it here...
Is there an existing issue for this?
Description
The
sar
tool failed to stop after 30 seconds.To Reproduce
No known steps to reproduce.
Actual Results
The
tm.logs
file contained:No other errors in any other log file could be found to indicate what happened.
Expected Results
Additional information
A separate issue related to this is that the
sar
process was still running and caused thetar
command to fail.