freeipa / freeipa-healthcheck

Check the health of a freeIPA installation
GNU General Public License v3.0
49 stars 27 forks source link

Intermittent replication errors when running ipa-healthcheck #283

Open Kivernitas opened 1 year ago

Kivernitas commented 1 year ago

Issue

Intermittent replication errors when running ipa-healthcheck. Running ipa-healthcheck every x minutes provides unreliable ReplicationChecks results. From what I've read on https://access.redhat.com/solutions/359683, getting a "replica is busy" is considered "normal". This make it difficult to monitor for actual replication errors.

Actual behaviour

  {
    "source": "ipahealthcheck.ds.replication",
    "check": "ReplicationCheck",
    "result": "ERROR",
    "uuid": "94548c4b-ca49-4f8a-bd2e-1953fba9f767",
    "when": "20230103141508Z",
    "duration": "0.304435",
    "kw": {
      "key": "DSREPLLE0003",
      "items": [
        "Replication",
        "Agreement"
      ],
      "msg": "The replication agreement (ipa-2.test.io-to-ipa-3.test.io) under \"dc=test,dc=io\" is not in synchronization.\nStatus message: error (1) can't acquire busy replica (unable to acquire replica: the replica is currently being updated by another supplier.)"
    }

Similar to the above error can happen intermittently on every freeipa server on a 3 node cluster. There aren't any replication errors most of the time.

Expected behavior

It should not report an error. A warning would be more suitable.

Version/Release/Distribution


Rocky Linux 8.6
Source : ipa-healthcheck-0.7-14.module+el8.7.0+1075+05db0c1d.src.rpm (latest available)
FreeIPA: 4.9
rcritten commented 1 year ago

This check is provided by 389 itself. I suppose we could consider reducing the severity to WARNING but I'd leave that as a call to them. @mreynolds389 what do you think?

mreynolds389 commented 1 year ago

This check is provided by 389 itself. I suppose we could consider reducing the severity to WARNING but I'd leave that as a call to them. @mreynolds389 what do you think?

Well it is a transient error. Replication is just busy at that time. If you run it again in a few seconds it will probably pass. For us we already set it to a "medium" severity.

Kivernitas commented 1 year ago

Thanks both for replying!

Yes it's a transient error. We run ipahealthcheck_exporter which basically scrapes ipa-healthcheck logs every 5 minutes. Can you suggest an alternative way of verifying replication health?

@mreynolds389 you mentioned you set it to "medium" severity, could I ask how?

mreynolds389 commented 1 year ago

Thanks both for replying!

Yes it's a transient error. We run ipahealthcheck_exporter which basically scrapes ipa-healthcheck logs every 5 minutes. Can you suggest an alternative way of verifying replication health?

@mreynolds389 you mentioned you set it to "medium" severity, could I ask how?

Well IPA is using DS's lib389 library for the DS healthchecks. IPA does not use DS's healthecheck severity level - it is ignored because there are basically two tools that were merged.

rexberg commented 2 months ago

@rcritten Since IPA does not use DS's healthcheck severity level could this checks severity level be lowered to WARNING in IPA?

rcritten commented 2 months ago

healthcheck doesn't ignore the DS severity. It converts it. See https://github.com/freeipa/freeipa-healthcheck/issues/283#issuecomment-2111803800

"medium" from DS is converted into a ipa-healthcheck ERROR severity.

rexberg commented 2 months ago

healthcheck doesn't ignore the DS severity. It converts it. See #283 (comment)

"medium" from DS is converted into a ipa-healthcheck ERROR severity.

Thanks for clarifying. Do we want to set this specific check's severity to WARNING bypassing the conversion? As mentioned it is a transient error but it is still triggering a ERROR severity.

rcritten commented 2 months ago

I suppose it's possible but it would be an ugly one-off. healthcheck has a rather thin wrapper to call the 389 checks and then re-format the return value. It's very generic code. It would be invasive to put in a test for a specific check.

rexberg commented 2 months ago

I looked at the code and would assume as much and I tend to agree. Currently we exclude this specific check since we can't really "trust" the ERROR trigger.