Sync can be "waiting for healthy state of /ConfigMap/<name>" for a while despite the config map can already be created

andrii-korotkov-verkada commented 4 months ago

Checklist:

[X] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[X] I've included steps to reproduce the bug.
[X] I've pasted the output of argocd version.

Describe the bug

The sync can be in progress waiting for the healthy state of the config map, while the sync state (accessible when clicking on the sync) can show that the config map was created. This unnecessarily slows down the sync and can take 30m+ sometimes.

To Reproduce

Create a config map manifest and have an ArgoCD app sync it. Maybe waiting time depends on the size of the cluster and/or number of apps.

Expected behavior

The config map is healthy and sync moves on once config map is reported to be created.

Screenshots

Version

ArgoCD has a version 2.10.1.

Logs

tooptoop4 commented 4 months ago

how many apps do u have? what k8s version?

andrii-korotkov-verkada commented 4 months ago

There are about 200 apps, the architecture is one EKS cluster per AWS account. The EKS k8s version is 1.26.

andrii-korotkov-verkada commented 4 months ago

One observation from logs with tasks list is that the config map in question has live object nil for a while and target object obj. I'm looking at this code that might populate live object, but it doesn't for some time https://github.com/argoproj/gitops-engine/blob/8a3ce6d85caa4220cfcaa8aa8b6d6dff476909ec/pkg/sync/sync_context.go#L707-L712.

andrii-korotkov-verkada commented 4 months ago

Looking at K8s audit logs for one of such cases, there are a couple of events with 404 response, e.g. (filtered)

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Metadata",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/default/configmaps/<name>",
    "verb": "get",
    "objectRef": {
        "resource": "configmaps",
        "namespace": "default",
        "name": "<name>",
        "apiVersion": "v1"
    },
    "responseStatus": {
        "metadata": {},
        "status": "Failure",
        "message": "configmaps \"<name>\" not found",
        "reason": "NotFound",
        "details": {
            "name": "<name>",
            "kind": "configmaps"
        },
        "code": 404
    },
}

Then there's a success with 201 code, which should mean created.

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Metadata",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/default/configmaps?fieldManager=argocd-controller",
    "verb": "create",
    "objectRef": {
        "resource": "configmaps",
        "namespace": "default",
        "name": "<name>",
        "apiVersion": "v1"
    },
    "responseStatus": {
        "metadata": {},
        "code": 201
    },
}

Note that the later URL doesn't include the config map name. Then, ~7 min later (can be longer), there's an entry with 200 code

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "Metadata",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/default/configmaps/<name>?dryRun=All&fieldManager=argocd-controller&force=true",
    "verb": "patch",
    "objectRef": {
        "resource": "configmaps",
        "namespace": "default",
        "name": "<name>",
        "apiVersion": "v1"
    },
    "responseStatus": {
        "metadata": {},
        "code": 200
    },
}

The later is followed by Argo log Updating resource result, status: 'Synced' -> 'Synced', phase 'Running' -> 'Succeeded', message 'configmap/<name> created' -> 'configmap/<name> created'.

andrii-korotkov-verkada commented 4 months ago

Looks like it tries to get the config map (a couple of times), then very shortly after makes a call to create it, but doesn't get an update about it until there's some patch later.

andrii-korotkov-verkada commented 4 months ago

I'm curious about this code https://github.com/argoproj/gitops-engine/blob/master/pkg/sync/sync_context.go#L1242-L1250. It seems like for config map creation we can also set phase to completed if it was running even in a wet run.

andrii-korotkov-verkada commented 4 months ago

One more discovery. This code https://github.com/argoproj/gitops-engine/blob/8a3ce6d85caa4220cfcaa8aa8b6d6dff476909ec/pkg/sync/sync_context.go#L635-L654 creates tasks for sc.resources, same one used to enrich tasks with live objects. It provides tasks in "Tasks from managed resources" log and they don't contain a task for a new config map.

Apparently the new config map is not a managed resource yet, so the task won't be enriched with a live obj. One difference in our manifests I found is that app.kubernetes.io/instance is not set in the repo manifests metadata labels (while app is set), but is eventually present on the resource in the cluster. Maybe by adding that to manifests I can solve this issue.

andrii-korotkov-verkada commented 4 months ago

Nope, adding that annotation doesn't help. I have to figure out how to add a target config map to the managed app resources in some other way that works.

andrii-korotkov-verkada commented 4 months ago

Hm, the instance annotation seems like a correct default way to specify belonging to an application https://argo-cd.readthedocs.io/en/stable/user-guide/resource_tracking/.

andrii-korotkov-verkada commented 4 months ago

Seems like both app.kubernetes.io/instance and app.kubernetes.io/part-of put the corresponding task under tasks for managed resources, but they don't seem to be updated - at least logs show just (,,) for their state.

andrii-korotkov-verkada commented 4 months ago

Apparently resource.Live is still nil in those cases for some time https://github.com/argoproj/gitops-engine/blob/8a3ce6d85caa4220cfcaa8aa8b6d6dff476909ec/pkg/sync/sync_context.go#L913

andrii-korotkov commented 3 months ago

I think I’ve finally figured it out. The root cause seems to be adding app operation and refresh events to the queues at the same time, e.g. https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L873. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.

The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted, but if not l, the refresh queue can be modified to store tuples with 2nd element being bool indicating whether to schedule the operation event after processing the refresh event.

I’ll write a fix later this week.

andrii-korotkov-verkada commented 3 months ago

I've deployed the PR changes in a live environment and found that config map is not being waited at all - it's created and sync is progressing fast to next steps!

Note, there's been some network errors, like lookup argocd-repo-server on <ip>:<port>: no such host, but that's happening even on v2.11.3 and probably related to some network issues in our cluster.

andrii-korotkov-verkada commented 3 months ago

Continuing the conversation in the PR.

andrii-korotkov-verkada commented 3 months ago

With the fix and reducing watch resync time from 10 to 5 min it behaves better, with observable waiting times being 1-3 minutes, sometimes 5 minutes. I haven't seen 10+ or 20+ minutes as before at least according to the related logs (they are not a perfect measure though).

andrii-korotkov-verkada commented 3 months ago

Other than that, I'm looking into optimizing watch performance, e.g. by excluding some resources from ArgoCD. It may be that there are just many resources in the clusters with many updates.

andrii-korotkov-verkada commented 3 months ago

There's still one problem remains which I don't know how to solve. If there's a reconciliation ongoing and resource update triggers another one, it might be that it'd get deduplicated by the queue. Thus the data may still remain stale. If reconciliations are generally fast, we can workaround this by adding a delay for adding the app key to the refresh queue. But we definitely want some deduplication to avoid unnecessary reconciliations. @agaudreault, @crenshaw-dev, what do you think?

andrii-korotkov-verkada commented 3 months ago

Actually nvm, this shouldn't be a problem, I misunderstood how the queue works:

Stingy: a single item will not be processed multiple times concurrently, and if an item is added multiple times before it can be processed, it will only be processed once.
Multiple consumers and producers. In particular, it is allowed for an item to be reenqueued while it is being processed.

Looks like if the item is taken out of the queue and being processed, same item can be enqueued again. This is also consistent with some logs I see.

andrii-korotkov-verkada commented 3 months ago

Seems like https://github.com/argoproj/argo-cd/pull/18694 and https://github.com/argoproj/gitops-engine/pull/603 solve the problem! https://github.com/argoproj/gitops-engine/pull/603#issuecomment-2212620560

argoproj / argo-cd

Sync can be "waiting for healthy state of /ConfigMap/<name>" for a while despite the config map can already be created #18500