Open andrii-korotkov-verkada opened 4 months ago
how many apps do u have? what k8s version?
There are about 200 apps, the architecture is one EKS cluster per AWS account. The EKS k8s version is 1.26.
One observation from logs with tasks list is that the config map in question has live object nil for a while and target object obj. I'm looking at this code that might populate live object, but it doesn't for some time https://github.com/argoproj/gitops-engine/blob/8a3ce6d85caa4220cfcaa8aa8b6d6dff476909ec/pkg/sync/sync_context.go#L707-L712.
Looking at K8s audit logs for one of such cases, there are a couple of events with 404 response, e.g. (filtered)
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "Metadata",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/default/configmaps/<name>",
"verb": "get",
"objectRef": {
"resource": "configmaps",
"namespace": "default",
"name": "<name>",
"apiVersion": "v1"
},
"responseStatus": {
"metadata": {},
"status": "Failure",
"message": "configmaps \"<name>\" not found",
"reason": "NotFound",
"details": {
"name": "<name>",
"kind": "configmaps"
},
"code": 404
},
}
Then there's a success with 201 code, which should mean created.
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "Metadata",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/default/configmaps?fieldManager=argocd-controller",
"verb": "create",
"objectRef": {
"resource": "configmaps",
"namespace": "default",
"name": "<name>",
"apiVersion": "v1"
},
"responseStatus": {
"metadata": {},
"code": 201
},
}
Note that the later URL doesn't include the config map name. Then, ~7 min later (can be longer), there's an entry with 200 code
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "Metadata",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/default/configmaps/<name>?dryRun=All&fieldManager=argocd-controller&force=true",
"verb": "patch",
"objectRef": {
"resource": "configmaps",
"namespace": "default",
"name": "<name>",
"apiVersion": "v1"
},
"responseStatus": {
"metadata": {},
"code": 200
},
}
The later is followed by Argo log Updating resource result, status: 'Synced' -> 'Synced', phase 'Running' -> 'Succeeded', message 'configmap/<name> created' -> 'configmap/<name> created'
.
Looks like it tries to get the config map (a couple of times), then very shortly after makes a call to create it, but doesn't get an update about it until there's some patch later.
I'm curious about this code https://github.com/argoproj/gitops-engine/blob/master/pkg/sync/sync_context.go#L1242-L1250. It seems like for config map creation we can also set phase to completed if it was running even in a wet run.
One more discovery. This code https://github.com/argoproj/gitops-engine/blob/8a3ce6d85caa4220cfcaa8aa8b6d6dff476909ec/pkg/sync/sync_context.go#L635-L654 creates tasks for sc.resources
, same one used to enrich tasks with live objects. It provides tasks in "Tasks from managed resources" log and they don't contain a task for a new config map.
Apparently the new config map is not a managed resource yet, so the task won't be enriched with a live obj.
One difference in our manifests I found is that app.kubernetes.io/instance
is not set in the repo manifests metadata labels (while app
is set), but is eventually present on the resource in the cluster. Maybe by adding that to manifests I can solve this issue.
Nope, adding that annotation doesn't help. I have to figure out how to add a target config map to the managed app resources in some other way that works.
Hm, the instance annotation seems like a correct default way to specify belonging to an application https://argo-cd.readthedocs.io/en/stable/user-guide/resource_tracking/.
Seems like both app.kubernetes.io/instance
and app.kubernetes.io/part-of
put the corresponding task under tasks for managed resources, but they don't seem to be updated - at least logs show just (,,) for their state.
Apparently resource.Live
is still nil in those cases for some time https://github.com/argoproj/gitops-engine/blob/8a3ce6d85caa4220cfcaa8aa8b6d6dff476909ec/pkg/sync/sync_context.go#L913
I think I’ve finally figured it out. The root cause seems to be adding app operation and refresh events to the queues at the same time, e.g. https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L873. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.
The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted, but if not l, the refresh queue can be modified to store tuples with 2nd element being bool indicating whether to schedule the operation event after processing the refresh event.
I’ll write a fix later this week.
I've deployed the PR changes in a live environment and found that config map is not being waited at all - it's created and sync is progressing fast to next steps!
Note, there's been some network errors, like lookup argocd-repo-server on <ip>:<port>: no such host
, but that's happening even on v2.11.3 and probably related to some network issues in our cluster.
Continuing the conversation in the PR.
With the fix and reducing watch resync time from 10 to 5 min it behaves better, with observable waiting times being 1-3 minutes, sometimes 5 minutes. I haven't seen 10+ or 20+ minutes as before at least according to the related logs (they are not a perfect measure though).
Other than that, I'm looking into optimizing watch performance, e.g. by excluding some resources from ArgoCD. It may be that there are just many resources in the clusters with many updates.
There's still one problem remains which I don't know how to solve. If there's a reconciliation ongoing and resource update triggers another one, it might be that it'd get deduplicated by the queue. Thus the data may still remain stale. If reconciliations are generally fast, we can workaround this by adding a delay for adding the app key to the refresh queue. But we definitely want some deduplication to avoid unnecessary reconciliations. @agaudreault, @crenshaw-dev, what do you think?
Actually nvm, this shouldn't be a problem, I misunderstood how the queue works:
Stingy: a single item will not be processed multiple times concurrently, and if an item is added multiple times before it can be processed, it will only be processed once.
Multiple consumers and producers. In particular, it is allowed for an item to be reenqueued while it is being processed.
Looks like if the item is taken out of the queue and being processed, same item can be enqueued again. This is also consistent with some logs I see.
Checklist:
argocd version
.Describe the bug
The sync can be in progress waiting for the healthy state of the config map, while the sync state (accessible when clicking on the sync) can show that the config map was created. This unnecessarily slows down the sync and can take 30m+ sometimes.
To Reproduce
Create a config map manifest and have an ArgoCD app sync it. Maybe waiting time depends on the size of the cluster and/or number of apps.
Expected behavior
The config map is healthy and sync moves on once config map is reported to be created.
Screenshots
Version
ArgoCD has a version 2.10.1.
Logs