Running fence_scsi_check_hardreboot consumes CPU.

HideoYamauchi commented 4 years ago

Hi All,

Configure a cluster using fence_scsi in a virtual environment to which only one CPU core is allocated.

When fence_scsi_check_hardreboot is used together with the watchdog service to configure the pacemaker cluster, fence_scsi_check_hardreboot uses 20% of the CPU every second.

When this happens, pacemaker frequently outputs the following log.

(snip)
12:56:45 xx pacemaker-controld  [10137] (throttle_check_thresholds)    notice: High CPU load detected: 2.080000
12:57:15 xx pacemaker-controld  [10137] (throttle_check_thresholds)    notice: High CPU load detected: 1.930000
12:57:45 xx pacemaker-controld  [10137] (throttle_check_thresholds)    notice: High CPU load detected: 1.540000
12:58:15 xx pacemaker-controld  [10137] (throttle_check_thresholds)    notice: High CPU load detected: 1.470000
12:58:45 xx pacemaker-controld  [10137] (throttle_check_thresholds)    notice: High CPU load detected: 1.230000
12:59:15 xx pacemaker-controld  [10137] (throttle_check_thresholds)    notice: High CPU load detected: 1.650000
(snip)

Some improvement can be achieved by increasing the number of CPU cores or increasing the monitoring interval of the watchdog service. However, some users may not be able to change core assignments. Increasing the monitoring interval also affects the failover time when a failure occurs.

Is there any way to improve the fence_scsi_check_hardreboot script to solve the problem? (Can make the processing of fence_scsi_check_hardreboot a little lighter?)

Best Regards, Hideo Yamauchi.

oalbrigt commented 4 years ago

You could try setting verbose=yes to see if you can track down what exactly causes the issue (will be explained in the agent's metadata if it's supported on your installed version of the agent).

HideoYamauchi commented 4 years ago

Hi Oyvind,

Our environment is RHEL8.0, and fence_scsi seems to support verbose=yes.

I tried to set verbose=yes in the fence_scsi parameter, but it seems that information is not output especially to pacemaker.log. Is the information output to other places?

Best Regards, Hideo Yamauchi.

oalbrigt commented 4 years ago

It might also be in corosync.log or /var/log/messages.

If you try to run it manually though it should be shown on your screen immediately.

HideoYamauchi commented 4 years ago

Hi Oyvind,

Thanks for your comment. I'll give it a try.

Best Regards, Hideo Yamauchi.

HideoYamauchi commented 4 years ago

Hi Oyvind,

Since the specification of the verbose option cannot be performed well, I forcibly changed the code of fence_scsi and enabled and executed the verbose, but it did not seem to get much useful information.

(snip)
def scsi_check(hardreboot=False):
        if len(sys.argv) >= 3 and sys.argv[1] == "repair":
                return int(sys.argv[2])
        options = {}
        options["--sg_turs-path"] = "/usr/bin/sg_turs"
        options["--sg_persist-path"] = "/usr/bin/sg_persist"
        options["--power-timeout"] = "5"
        options["retry"] = "0"
        options["retry-sleep"] = "1"
        options = scsi_check_get_options(options)
#       if "verbose" in options and options["verbose"] == "yes":
        logging.getLogger().setLevel(logging.DEBUG)
(snip)

[root@rh80-02 ~]#  /etc/watchdog.d/fence_scsi_check_hardreboot test 
INFO:root:Executing: /usr/bin/sg_turs /dev/sdb

DEBUG:root:0  

INFO:root:Executing: /usr/bin/sg_persist -n -i -k -d /dev/sdb

DEBUG:root:0   PR generation=0x5fb3, 8 registered reservation keys follow:
    0x5e2a0001
    0x5e2a0001
    0x5e2a0001
    0x5e2a0001
    0x5e2a0000
    0x5e2a0000
    0x5e2a0000
    0x5e2a0000

DEBUG:root:key 5e2a0001 registered with device /dev/sdb

Also, it seems that the same high CPU load occurs when using the watchdog service with fence_mpath.

I will investigate the cause a little more.

Best Regards, Hideo Yamauchi.

oalbrigt commented 4 years ago

Maybe there's some watchdog setting for tuning priority of the process?

HideoYamauchi commented 4 years ago

Hi Oyvind,

Maybe there's some watchdog setting for tuning priority of the process?

Yes.

In the environment in question, the default settings in /etc/watchdog.conf are as follows:

(snip)
# This greatly decreases the chance that watchdog won't be scheduled before
# your machine is really loaded
realtime                = yes
priority                = 1
(snip)

Many thanks, Hideo Yamauchi.

oalbrigt commented 4 years ago

I would try changing the priority to see if that helps.

HideoYamauchi commented 4 years ago

Hi Oyvind,

I would try changing the priority to see if that helps.

I'll give it a try....

But...

I changed the priority to 50 or 99 and restarted the watchdog service, but it seems that the CPU usage of fence_scsi_check_hardreboot does not change.

It seems that you can confirm that the CPU usage rises simply by the following command line.

 /usr/libexec/platform-python -c 'import sys;sys.path.append("/usr/share/fence");import fencing'

I think this improvement seems to be difficult for python import processing.

Best Regards, Hideo Yamauchi.

oalbrigt commented 4 years ago

Yeah. I dont know how we can improve that.

HideoYamauchi commented 4 years ago

Hi Oyvind,

I think a little more about improvement.

It may be the right conclusion that this improvement is difficult in Python. In that case, you will need to dedicate more CPU resources to virtual machines and so on.

Best Regards, Hideo Yamauchi.

ClusterLabs / fence-agents

Running fence_scsi_check_hardreboot consumes CPU. #313