ClusterLabs / fence-agents

Fence agents
104 stars 160 forks source link

[Question] About disable-timeout. #417

Closed HideoYamauchi closed 3 years ago

HideoYamauchi commented 3 years ago

Hi All,

In fence_agent included in RHEL 8.4, disable-timeout is enabled by default. However, we have seen that when this becomes the default, the fencing behavior in the absence of topology will change significantly from RHEL 8.3. This issue has been reported to Bugzilla:

Isn't it better if disable-timeout defaults to false so that it doesn't cause confusion?

If you already have information on a discussion that enables disable-timeout, please let me know.

Best Regards, Hideo Yamauchi.

oalbrigt commented 3 years ago

You can set disable_timeout=0 if it doesnt behave as you expect it to in your setup.

HideoYamauchi commented 3 years ago

Hi Ovyind,

Thanks for your comment.

You can set disable_timeout=0 if it doesnt behave as you expect it to in your setup.

We understand that it works the same as before by changing the settings.

When the user upgrades, they may suddenly run into this problem. The default is that setting it to 0 means less confusion for the user.

Is this the initial value of disable-timeout that has already been discussed and will not change?

Best Regards, Hideo Yamauchi.

HideoYamauchi commented 3 years ago

And...

Bugzilla reports the difference in fencing escalation as a problem. Another problem at this time is the difference in failover time. Until now, fencing had been retried with a timeout of about 20s.

However, if disable-timeout is enabled, fencing will not be retried until stonith-timeout (60s by default). As a result, if fencing is successful in the retry, there will be a difference of about 40s.

We think it's a problem to increase the failover time by enabling disable-timeout (default).

Best Regards, Hideo Yamauchi.

kgaillot commented 3 years ago

I think this is a situation where some users are going to be confused either way.

The original intent of the change was that fencing timeouts are fairly common, and users were confused by the number of timeout options available. By disabling agent-internal timeouts when Pacemaker is managing the agent, the user can modify just the Pacemaker timeout.

But that causes this issue, where if an agent previously could time out one internal step and return an error, Pacemaker had time to retry, whereas if Pacemaker times out the agent, there is no time left for a retry.

Comparing what happens in the two situations:

Scenario Original approach Current approach
Agent sometimes times out on an internal step but works the next attempt "just works" User must either set disable_timeout=0, or lower stonith-timeout to get quick retries
Internal steps sometimes trip internal timeouts User has to research up to 3 different timeouts, and figure out which ones to change and to what Either "just works," or user must increase stonith-timeout

Considering both scenarios, the impact on the user seems less with the current approach.

HideoYamauchi commented 3 years ago

Hi Ken,

Thanks for your comment.

I was able to understand why disable-timeout was included.

Tomorrow we have a meeting with our members. Let's look at your answer again and discuss it.

Many thanks, Hideo Yamauchi.

HideoYamauchi commented 3 years ago

Hi Ken, Hi Ovyind,

I talked to the members.

Either way, I understand that users will be confused.

Again, we are concerned about longer failover times with previous user settings. If the user has configured topology, the failover time will be even longer than before.

Do you have any plans to improve this disable-timeout function in the future in order to shorten the failover time?

Ultimately, if your decision is that disable-timeout defaults are enabled for less user confusion, we'll follow that decision. (If necessary, set disable-timeout to false to get the same behavior as before)

To Ken: There is something I'm more concerned about. It is an internal story on the pacemaker side, but it is a case of setting considering stonith-timeout in order to execute Query before fencing operation. This is about pacemaker, so let me talk about it in Bugzilla a while ago. --https://bugs.clusterlabs.org/show_bug.cgi?id=5473

Best Regards, Hideo Yamauchi.

kgaillot commented 3 years ago

Hi Hideo,

Hi Ken, Hi Ovyind,

I talked to the members.

Either way, I understand that users will be confused.

Again, we are concerned about longer failover times with previous user settings. If the user has configured topology, the failover time will be even longer than before.

Do you have any plans to improve this disable-timeout function in the future in order to shorten the failover time?

I'm not sure there is anything more that could be done automatically (without user configuration). The goal of disable-timeout is to give users the simplicity of a single stonith-timeout by default, but they still have the option of more fine-grained control by setting disable-timeout=false and setting the agent-internal timeouts themselves, and/or setting the pcmk_*_timeout options.

I think the ideal timeout values vary by deployment, so it is difficult to automate. I would hope that in most cases, to shorten failover, it will be sufficient for users to lower stonith-timeout to whatever is appropriate for their devices. In more complicated cases, they can adjust all of the other timeout options.

Perhaps the documentation needs a guide to setting timeouts.

Ultimately, if your decision is that disable-timeout defaults are enabled for less user confusion, we'll follow that decision. (If necessary, set disable-timeout to false to get the same behavior as before)

To Ken: There is something I'm more concerned about. It is an internal story on the pacemaker side, but it is a case of setting considering stonith-timeout in order to execute Query before fencing operation. This is about pacemaker, so let me talk about it in Bugzilla a while ago. --https://bugs.clusterlabs.org/show_bug.cgi?id=5473

Best Regards, Hideo Yamauchi.

HideoYamauchi commented 3 years ago

Hi Ken,

What you commented on Bugzilla gave me a little idea of how to improve the settings. I will investigate a little more.

Best Regards, Hideo Yamauchi.

HideoYamauchi commented 3 years ago

Hi Ken, Hi Ovyind,

Even in an environment where disable-timeout is enabled, using pcmk_status_timeout / pcmk_list_timeout solved our problem.

I close this issue.

Many thanks, Hideo Yamauchi.

oalbrigt commented 3 years ago

Nice. I'm glad to hear it.