Custom resource health checks randomly not evaluated when using glob pattern

duizabojul commented 7 months ago

Describe the bug

When utilizing custom resource health checks with glob patterns, Lua scripts are inconsistently invoked, resulting in the application controller randomly failing to report resource health.

Upon investigation of the codebase, I identified the underlying issue:

To locate the health Lua script associated with a resource, the controller iterates over all resource override keys until it finds a key matching the resource group/version.

The primary problem here lies in the fact that map keys are not ordered. Consequently, if a resource group/version matches multiple resource override glob keys, the controller randomly selects one of these keys based on the order of iteration. This leads to the random usage of one of the Lua scripts if they exist.

While it might seem improbable to define multiple globs matching the same group/kind resource, another issue arises. The controller adds default resource overrides under the */* key.

Due to this additional resource override, which matches all resources, the iteration over map keys may randomly match this key instead of the user-defined key. The */* resource override lacks a Lua script, causing the resource to have no health check evaluated.

To address this, I propose a fix that involves sorting keys by string length (from longer key to shorter):

func GetWildcardConfigMapKey(vm VM, gvk schema.GroupVersionKind) string {
    gvkKeyToMatch := GetConfigMapKey(gvk)

    resourceOverridesKeys := make([]string, len(vm.ResourceOverrides))

    i := 0
    for k := range vm.ResourceOverrides {
        resourceOverridesKeys[i] = k
        i++
    }

    // Sort the keys so that the longest key is first
    sort.Slice(resourceOverridesKeys, func(i, j int) bool {
        return len(resourceOverridesKeys[i]) > len(resourceOverridesKeys[j])
    })

    for _, key := range resourceOverridesKeys {
        if glob.Match(key, gvkKeyToMatch) {
            return key
        }
    }

    return ""
}

By sorting the keys, the */* key is consistently evaluated last. This not only addresses the issue with the random evaluation but also provides an added benefit: it allows the definition of a health check script with a glob and its subsequent override with a longer glob. For instance, defining *.upbound.io/* and *.aws.upbound.io/* health checks ensures that the health evaluation for the s3.aws.upbound.io group will use the *.aws.upbound.io/* Lua script.

Version

argocd: v2.9.3+6eba5be
  BuildDate: 2023-12-01T23:05:50Z
  GitCommit: 6eba5be864b7e031871ed7698f5233336dfe75c7
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64

amanfredi commented 7 months ago

Wow, nice find! I was wondering why the custom crossplane health checks appeared to be randomly toggling on and off.

mbbush commented 6 months ago

I like this solution, especially as it provides a place to make "find the best match" smarter than just length-based. Length-checking is a clever shortcut, and might be entirely sufficient if argo only supports globs, but I wonder if it might be a good idea to treat the group and kind match sections separately, and/or use a metric that evaluates the length of non-glob sections of the pattern, in order.

mbbush commented 6 months ago

I think given the following globs, I'd want them applied in this order:

*/ProviderConfigUsage
*/ProviderConfig
s3.aws.upbound.io/*
*.aws.upbound.io/*
*.upbound.io/*
*/*

So maybe a good metric would be to sort first by the length of the kind pattern, then by the length of the group pattern?

It would probably be good for someone with a non-crossplane use case to also chime in on their preferences.

jan-mrm commented 5 months ago

nice find! There is no workaround or is there until the issue is fixed?

mbbush commented 4 months ago

I'm not aware of any workaround, and as I'm about to start relying on argocd healthchecks as part of my CD pipeline, I really want them to be more stable.

@jan-mrm @amanfredi @duizabojul does my suggestion of sorting first by the length of the kind pattern, then by the length of the group pattern seem like a good idea to you?

jan-mrm commented 4 months ago

@mbbush I personally don't think that I have a preference between "overall" length vs "kind-then-group" length. I don't see a downside for the "kind-then-group" length sorting at the moment. From a non-crossplane perspective I'm not sure how applicable/beneficial it is but I probably don't have a good overview here about how many equal kinds from different groups there are, that follow the same schema so that the health check can be equal so that this sorting benefits them. But I guess even if not and you then wanted to differentiate between them you could always define multiple checks again anyways. So no preference. I could work with both :)

llavaud commented 3 months ago

any updates ?

mbbush commented 3 months ago

I've got a local branch that does this, I just haven't written the tests yet.

agaudreault commented 3 months ago

I don't think the length by itself is very deterministic. You basically want to use the most specific configuration, but this is very subjective using the name.

I think the algorithm could be For GK1 and GK2

If the first glob (*) starting from the end occurs earlier in G1, G1 has lower priority
If G1 has more *, G1 is lower priority
If G1 is shorter, it is lower priority
If G1 is smaller lexicographically, it is lower priority
If G1 and G2 are equal, evaluate K1 and K2
If the first glob (*) occurs earlier in K1, K1 has lower priority
If K1 ends in * and K2 doesn't, K1 is lower priority
If K1 has more *, K1 is lower priority
If K1 is shorter, it is lower priority
If K1 is smaller lexicographically, it is lower priority

You would end up with this order. The logical reasoning behind this would be that if you can specify the resource Kind without wildcard, then you are able to specify the group.

s3.aws.upbound.io/Something
s3.aws.upbound.io/Some*
s3.aws.upbound.io/*
*.aws.upbound.io/Something
*.aws.upbound.io/Something*
*.aws.upbound.io/Some*
*.aws.upbound.io/*thing
*.aws.upbound.io/*m*th
*.aws.upbound.io/*
*.upbound.io/*
*/ProviderConfigUsage
*/ProviderConfig
*/*

Perhaps adding a priority integer would help create a generic deterministic order in a clearer and more objective way?

duizabojul commented 3 months ago

I don't think we should overcomplicate things. Sorting by the full pattern size or by kind size, followed by group size as suggested by @mbbush, is sufficient to fix this bug. This approach will add consistency across runs and remain easy for everyone to understand.

argoproj / argo-cd

Custom resource health checks randomly not evaluated when using glob pattern #16905