SUSE / SAPHanaSR

SAP HANA System Replication Resource Agent for Pacemaker Cluster
GNU General Public License v2.0
26 stars 21 forks source link

HANA_CALL - handle all timeout return codes, not only 124 #248

Open PeterPitterling opened 5 months ago

PeterPitterling commented 5 months ago

https://man7.org/linux/man-pages/man1/timeout.1.html

Exit status: 124 if COMMAND times out, and --preserve-status is not specified 125 if the timeout command itself fails 126 if COMMAND is found but cannot be invoked 127 if COMMAND cannot be found 137 if COMMAND (or timeout itself) is sent the KILL (9) signal (128+9)

  • the exit status of COMMAND otherwise

https://github.com/SUSE/SAPHanaSR/blob/8db3d754965348bb2808ce117ba539a468608c32/ra/SAPHana#L692-L697

fmherschel commented 5 months ago

Why?

fmherschel commented 5 months ago

What was the CONCRETE error condition?

fdanapfel commented 3 months ago

@fmherschel in the SAPHanaTopology RA there are already checks for other return values, for example https://github.com/SUSE/SAPHanaSR/blob/8db3d754965348bb2808ce117ba539a468608c32/ra/SAPHanaTopology#L981 https://github.com/SUSE/SAPHanaSR/blob/8db3d754965348bb2808ce117ba539a468608c32/ra/SAPHanaTopology#L1091 https://github.com/SUSE/SAPHanaSR/blob/8db3d754965348bb2808ce117ba539a468608c32/ra/SAPHanaTopology#L1095 https://github.com/SUSE/SAPHanaSR/blob/8db3d754965348bb2808ce117ba539a468608c32/ra/SAPHanaTopology#L1114 https://github.com/SUSE/SAPHanaSR/blob/8db3d754965348bb2808ce117ba539a468608c32/ra/SAPHanaTopology#L1118

So the question for me would be why the other error codes for the timeout command are takein ito account in the SAPHanaTopology RA but not in the SAPHana RA?

fmherschel commented 3 months ago

@fdanapfel RC 124 happens regularly. I have never seen 125, 126, 127. RC 124 is by plan. So again did we have had a concrete situation where we did catch other RCs?

ja9fuchs commented 3 months ago

@fmherschel Yes we do. :) Although that particular problem only affects scale-out due to a minor difference in the Topology code. But the new merged code seems to have that gap again.

We have a case of RC 134 in a scale-out environment which leads to a failover, due to the role being set to some rubbish from that error. Although the RC means an ABRT of the python command, it's similar to a timeout and should not cause an immediate HANA resource failure.

In the scale-up SAPHanaTopology RA this happens to be automatically covered by the -ge 124 conditionals, while the same cases in the scale-out SAPHanaTopology RA are strictly -eq 124 in the monitor function. It also looks like it is again restricted to exact matches of only 124 in the new combined NG code.

Do you have specific concerns about generally allowing this bit of tolerance of any RC >= 124? Keeping this in some places would reduce the impact of at least a bad role set, caused by issues that are not related to the HANA landscape. Very similar to a command timeout, this could just be a temporary problem on the system.

fmherschel commented 3 months ago

@fdanapfel , @ja9fuchs Ah, so you just like to change from "-eq 124" to "-ge 124", right? Then I did misunderstand the initial issue description. I had the impression the initial request was to differ the rc codes. This would be every test intensive. To change from "-eq 124" to "-ge 124" should be fine. However I would like to test that also with my semi-automated tester. Sorry for the misunderstanding. Good catch.

PeterPitterling commented 3 months ago

I'm fine with that change, but the RC should be part of the warn message

ja9fuchs commented 3 months ago

@fdanapfel , @ja9fuchs Ah, so you just like to change from "-eq 124" to "-ge 124", right? Then I did misunderstand the initial issue description. I had the impression the initial request was to differ the rc codes. This would be every test intensive. To change from "-eq 124" to "-ge 124" should be fine. However I would like to test that also with my semi-automated tester. Sorry for the misunderstanding. Good catch.

@fmherschel Correct, this is my suggestion. Although it is not exactly the same as @PeterPitterling originally requested. But I think it would be a good solution to cover these RCs in a consistent way in the different RAs. Maybe with a slight change of wording in the log output, to make it more generic and not refer to the result as a timeout only. For example call it "command error" instead of "timeout". By including the RC in the output this should be enough info for individual troubleshooting.

fmherschel commented 3 months ago

@ja9fuchs Yes I also want to change that. Also the angi code will have some changes on that part. It will take some time as this also needs some testing. I was afraid to differ e.g. rc==125 in the RAs reaction. Because that is really not easy. But we could of course treat a (e.g.) rc==125 in the same way as a rc==124 as a "failed answer". And following the comment of @PeterPitterling we will add the RC to the log messages. Please give me some time to do that.

ja9fuchs commented 3 months ago

@fmherschel If I understand it correctly, you will work on adjustments of the angi code. In the meanwhile, if that is ok, I would create a small PR in the separate scale-out repository to make the minimum of adjustment there to cover the agreed change. This way the work is distributed and we can clarify details in the PR to get it right.

fmherschel commented 3 months ago

@ja9fuchs That would be great!

fdanapfel commented 3 months ago

@fmherschel My main question actually was why the resource agents use "-eq 124" in some cases and "-ge 124" in others.

I agree with @ja9fuchs that it would be good to make it more consistent that only one type of check is used, preferably "-ge 124" to also cover other values that might be returned. But there is no need in my opinion to differentiate between the RCs.

And I also like the suggestion from @PeterPitterling to include the RC in the messages that get logged.

fmherschel commented 3 months ago

@fdanapfel Why does code has different handlings? It might be that we have learned during the "aging" of SAPHanaSR and did miss the other code sections. I currently do not expect that there is a good reason why we should differ between -eq and -ge. But I will also ask @angelabriel, if I miss something which explains the difference. If I look at the timeout-callling/using code I see that we have had the need to change the call from default (which uses SIGTERM) to SIGKILL (9). This unfortunately changes the timeout return code from 124 to 137 (following the man pages and my local tests). So in these case the -eq 124 seams to be wrong. Sorry for this error.

ja9fuchs commented 3 months ago

@fmherschel Thank you for accepting and merging the PR in scale-out! :+1:

@PeterPitterling Would you please check if that covers your request well enough, for classic scale-out? If yes, it might help aligning the angi code in a consistent way.

fmherschel commented 3 months ago

@ja9fuchs Do you also check the scale-up-classic code in the branch maintenence-classic or should I do this? Angi code will be adapted be me. Then we have all 3 projects (including smells) updated.

ja9fuchs commented 3 months ago

@ja9fuchs Do you also check the scale-up-classic code in the branch maintenence-classic or should I do this? Angi code will be adapted be me. Then we have all 3 projects (including smells) updated.

@fmherschel I actually compared to the classic scale-up code side-by-side while doing the scale-out changes. The scale-up sht_monitor_clone function uses -ge 124 only already.

Note: The HANA_CALL function still contains conditionals with -eq 124, but this seems really about timeout handling and it only adds log messages. That is similar in both Topology agents and I did not touch it. This function does not take a decision, but provides the RC to the parent function that called it, like the monitor.This is where actual decisions are made based on the conditionals, which only needed the adjustment in the classic scale-out Topology code.

ja9fuchs commented 3 months ago

I checked all conditionals around 124 in each of the classic RAs and those that are still -eq 124 are not taking decisions or influencing the health. They might be enhanced for cosmetical/logging reasons, but I considered those not part of issues that needed fixing.

PeterPitterling commented 3 months ago

Note: The HANA_CALL function still contains conditionals with -eq 124, but this seems really about timeout handling and it only adds log messages. That is similar in both Topology agents and I did not touch it. This function does not take a decision, but provides the RC to the parent function that called it, like the monitor.This is where actual decisions are made based on the conditionals, which only needed the adjustment in the classic scale-out Topology code.

That was actually my initial request. Also this section should handle ge 124 and provide the RC in the error log