distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
188 stars 108 forks source link

Failure to stop `sar` tool from Tool Meister #2691

Open portante opened 2 years ago

portante commented 2 years ago

Is there an existing issue for this?

Description

The sar tool failed to stop after 30 seconds.

To Reproduce

No known steps to reproduce.

Actual Results

The tm.logs file contained:

192.168.122.192 0272 2022-03-14 21:37:29,862 INFO 26848 140432394401600 pbench-tool-meister wait 250 -- Waiting for transient tool sar stop process
192.168.122.192 0273 2022-03-14 21:37:34,863 INFO 26848 140432394401600 pbench-tool-meister _wait_for_process 228 -- Transient tool sar has not stopped after 5 seconds
192.168.122.192 0274 2022-03-14 21:37:39,864 INFO 26848 140432394401600 pbench-tool-meister _wait_for_process 228 -- Transient tool sar has not stopped after 10 seconds
192.168.122.192 0275 2022-03-14 21:37:44,866 INFO 26848 140432394401600 pbench-tool-meister _wait_for_process 228 -- Transient tool sar has not stopped after 15 seconds
192.168.122.192 0276 2022-03-14 21:37:49,868 INFO 26848 140432394401600 pbench-tool-meister _wait_for_process 228 -- Transient tool sar has not stopped after 20 seconds
192.168.122.192 0277 2022-03-14 21:37:54,870 INFO 26848 140432394401600 pbench-tool-meister _wait_for_process 228 -- Transient tool sar has not stopped after 25 seconds
192.168.122.192 0278 2022-03-14 21:37:59,872 ERROR 26848 140432394401600 pbench-tool-meister wait 257 -- Killed un-responsive transient tool sar stop process after 30 seconds

No other errors in any other log file could be found to indicate what happened.

Expected Results

192.168.122.192 0272 2022-03-14 21:37:29,862 INFO 26848 140432394401600 pbench-tool-meister wait 250 -- Waiting for transient tool sar stop process

Additional information

A separate issue related to this is that the sar process was still running and caused the tar command to fail.

portante commented 2 years ago

This is a long standing problem with the sar tool.

Basically, sar is a a simple wrapper that sysstat provides around the sadc command. The sar command creates a child process which runs the sadc command itself. The pbench sar "datalog" execs the sar command.

However, the pbench sar tool support records the PID of the parent sar command, and issues a kill -TERM of that pid. But the sar command just immediately dies from the KILL leaving the sadc process running in the background. The sadc process will only be cleaned up with it tries to communicate with its parent process and then dies via a SIGPIPE.

When long intervals are used for the pbench sar tool settings, the sadc process will wait around up to that amount of time before noticing the SIGPIPE. E.g. a 120 second interval would create a situation where the sadc process would wait, at most, another 120 seconds before actually exiting after its sar parent process was killed.

There are a couple of ways to solve this problem:

  1. change the pbench sar tool to run sadc directly, and then use the sar command during the "stop post-processing" phase to create the expected output
  2. special case the stop method of the sar tool to wait for the sadc process to exit

A third way is to replace the use of sysstat with PCP entirely.

webbnh commented 2 years ago

the pbench sar tool support records the PID of the parent sar command, and issues a kill -TERM of that pid. But the sar command just immediately dies from the KILL leaving the sadc process running in the background.

Do we understand why the kill -TERM affects only the target process and not its subprocess? Should we be trying to kill the who job tree? (Obviously, using PCP would be preferable, but perhaps this wouldn't be hard?)

portante commented 2 years ago

Not entirely sure, but I used strace on the parent sar process and on the child sadc process. When a SIGTERM is sent to the parent sar process, it is an unhandled signal, so the process makes no other systems calls and just dies. The child sadc process eventually wakes up at the interval and dies with a SIGPIPE.

When the child sadc process receives a SIGTERM, it also does not have a signal handler installed and just dies, but the parent process detects the child dying (via SIGCHILD) and reports, "End of data collecting unexpected."

So I think we want to SIGTERM the parent sar process, and then SIGTERM the child sadc process.

webbnh commented 2 years ago

I think we want to SIGTERM the parent sar process, and then SIGTERM the child sadc process.

👍

portante commented 2 years ago

This represents a class of work that we need, but not for the v0.71 release.

ldoktor commented 2 years ago

Let me recommend you one function from Avocado ;-) https://github.com/avocado-framework/avocado/blob/master/avocado/utils/process.py#L235 (search for kill_process_tree). Maybe you can re-use it here...