argoproj / gitops-engine

Democratizing GitOps
https://pkg.go.dev/github.com/argoproj/gitops-engine?tab=subdirectories
Apache License 2.0
1.7k stars 260 forks source link

fix: handle stale BeforeHookCreation resources as progressing #461

Open nazarewk opened 2 years ago

nazarewk commented 2 years ago

TLDR; DELETE event during handling of BeforeHookCreation deletion policy can be delayed while etcd is running garbage collection:

  1. at the same time "finished" hook is present in the cache,
  2. gitops-engine does not verify creation timestamp of BeforeHookCreation objects against start time of sync operation,
  3. gitops-engine is unaware (old) object should be purged from cache and happily proceeds to next Sync Wave

The PR fixes this specific failure mode with minimal changes to the code.

The change is impossible to reproduce as etcd's GC usually does not take long enough to trigger the issue. At the same time it is very difficult to debug.

the EKS cluster I observed issues at had tens of thousands objects with >1k objects churned every few minutes and still GC triggered the issue only few times a week

fixes https://github.com/argoproj/gitops-engine/issues/446 closes https://github.com/argoproj/argo-cd/pull/10579 original issue https://github.com/argoproj/argo-cd/issues/10077

codecov[bot] commented 2 years ago

Codecov Report

Base: 55.38% // Head: 55.59% // Increases project coverage by +0.21% :tada:

Coverage data is based on head (e4d98aa) compared to base (517c1ff). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #461 +/- ## ========================================== + Coverage 55.38% 55.59% +0.21% ========================================== Files 41 41 Lines 4478 4493 +15 ========================================== + Hits 2480 2498 +18 + Misses 1807 1804 -3 Partials 191 191 ``` | [Impacted Files](https://codecov.io/gh/argoproj/gitops-engine/pull/461?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=argoproj) | Coverage Δ | | |---|---|---| | [pkg/sync/sync\_context.go](https://codecov.io/gh/argoproj/gitops-engine/pull/461/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=argoproj#diff-cGtnL3N5bmMvc3luY19jb250ZXh0Lmdv) | `73.41% <100.00%> (+0.79%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=argoproj). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=argoproj)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

nazarewk commented 2 years ago

seems like syncContext.getOperationPhase() is called only in a very specific case (not the one being tested). How should i proceed? Feels like I'm misunderstanding what happens step by step

I want to confirm following scenario:

  1. Previous sync has finished succesfully

    setting up `sc.syncRes` to imitate what ArgoCD would have passed ``` hookKey := kube.GetResourceKey(hook) syncCtx.syncRes[fmt.Sprintf("%s:%s", hookKey.String(), synccommon.HookTypePreSync)] = synccommon.ResourceSyncResult{ ResourceKey: hookKey, Version: hook.GroupVersionKind().Version, Status: synccommon.ResultCodeSynced, Message: "created", HookType: synccommon.HookTypePreSync, HookPhase: synccommon.OperationSucceeded, SyncPhase: synccommon.SyncPhasePreSync, Order: 1, } ```
  2. New (1st tick) Sync is started - Live hook was created before this

  3. old hook got deleted and recreated - Live hook is still old

  4. 2nd tick of Sync gets stuck on PreSync because Live hook is still old

  5. bump CreationTimestamp on Live object

  6. 3rd tick of Sync is waiting for hook to finish

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

nazarewk commented 2 years ago

managed to fully test expected flow, PR is ready for review