Lack of context for failing reconciliation

tun0 commented 1 year ago

{
  "level": "error",
  "ts": "2023-03-20T14:08:31.180Z",
  "msg": "Reconciler error",
  "controller": "imageupdateautomation",
  "controllerGroup": "image.toolkit.fluxcd.io",
  "controllerKind": "ImageUpdateAutomation",
  "ImageUpdateAutomation": {
    "name": "apps",
    "namespace": "flux-system"
  },
  "namespace": "flux-system",
  "name": "apps",
  "reconcileID": "081a105f-7672-4fc7-b532-26be91972eeb",
  "error": "object not found"
}

This doesn't provide enough context to determine what actually is going wrong here.

makkes commented 1 year ago

Take a look at the events of that object using flux events or kubectl describe.

tun0 commented 1 year ago

I have a pretty good idea what's the underlying issue here. But still would be nice to see more info in the error itself.

Current idea: we have 1 git repo for 2 clusters. Both with webhooks for image updates. It's quite likely a race condition between the 2 Flux instances (one in each cluster). By the time one cluster tries to push, its HEAD is no longer up-to-date as it was already altered by the other cluster. Saw the same with Flux v1, but we didn't care for its logs as much as we do with v2 😉

Assuming the above to be correct, it'd be nice for it to retry (basically rebase?) instead of failing. On the other hand, we should probably invest some time in proper monitoring & alerting, instead of just dumping everything to slack directly. As Kubernetes is kinda all about "it's okay to fail, sometimes", being (primarily) stateless and all that.

makkes commented 1 year ago

Oh I see. Agree that the error should contain more info on the root cause.

mantasaudickas commented 1 year ago

I have the same error.. also no context and have no idea what's going on here :) Seems like everything is deployed, updated etc.. flux events does not show anything failing, and have no idea what I could describe with kubectl.. describe image-automation-controller does not show any new events at that time when error is displayed. Issue started with upgrade to 0.41.1 version.

kingdonb commented 1 year ago

There are a couple of reports of this type of failure (or potentially unrelated failures eg. git error code 128) that are showing up in the Slack channel, I haven't seen them filter down to reports for IAC as of yet, but something to be aware of.

I will load up some Image Update Automation controls today or tomorrow and try to reproduce this issue, one or the other issue, there is not much context to go on for what is causing the failure. I understand this report is not about one specific failure, but the general case of failure not being reported very clearly with a good obvious link to a really specific root cause.

{"level":"error","ts":"2023-04-03T19:27:38.561Z","msg":"Reconciler error","controller":"imageupdateautomation","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageUpdateAutomation","ImageUpdateAutomation":{"name":"flux-system","namespace":"flux-system"},"namespace":"flux-system","name":"flux-system","reconcileID":"a43e903f-de19-4eaa-a7cd-e64a804d77fa","error":"malformed unpack status: \u00010069\u0001001dunpack index-pack failed\n0043ng refs/heads/main error processing packfiles: exit status 128\n0000"}

This is another example of that. This is the error returned from Git, and I'm not sure how much helpful parsing we can do, but to refocus, the subject of this report is about making it clearer what has gone wrong when IAC fails. Maybe we can come up with some common failure scenarios and start classifying errors to raise those as conditions, based on a pattern matching.

mantasaudickas commented 1 year ago

We have migrated our repository to another provider (migrated to gitlab from bitbucket.org). And seems like these errors are gone. Nothing else changed: cluster and fluxcd versions remained the same. What I did just flux uninstall and flux bootstrap with new git repository url. So seems like bitbucket.org have some specialties which produces this error? For the completeness: I have tried to uninstall and install also with bitbucket.org.. but it did not helped.

PaulSxxxs commented 1 year ago

I'm having this same issue and have gone down the route of changing gitImplementation to use libgit2, but we're using source.toolkit.fluxcd.io/v1beta2 so this has been deprecated, (https://github.com/fluxcd/source-controller/blob/main/docs/spec/v1beta1/gitrepositories.md#git-implementation)

Furthermore v1beta2 recommends setting --feature-gates=OptimizedGitClones=false which I don't know how to achieve .. any tips on how to enable this? (https://github.com/fluxcd/source-controller/blob/main/docs/spec/v1beta2/gitrepositories.md#optimized-git-clones)

@mantasaudickas, We're also using bitbucket and have the scenario of multiple clusters using the same repo. Are you having good results so far?

mantasaudickas commented 1 year ago

@mantasaudickas, We're also using bitbucket and have the scenario of multiple clusters using the same repo. Are you having good results so far?

I have switched one project to gitlab and another to github (2 independent clients). So far error message "object not found" is gone in both of them. I did not tried your mentioned options.

The reason for switch actually was bitbucket issue - that once FluxCD makes a push - its not possible to get that last push anymore (while it is visible using UI, but not fetchable to local copies and not visible in git command line history)... I don't know if its a flux or bitbucket issue, but it was solved by migrating to other providers.

dewe commented 1 year ago

... once FluxCD makes a push - its not possible to get that last push anymore (while it is visible using UI, but not fetchable to local copies and not visible in git command line history)...

We've seen the same behaviour and have raised a support ticket with Bitbucket... still no solution though.

mantasaudickas commented 1 year ago

have raised a support ticket with Bitbucket... still no solution though

Same here, since it was blocking us - we switched manifest repository location to another provider.. and now thinking to switch everything :)

tobiasjochheimenglund commented 1 year ago

Also getting the "Object not found" error using flux with Bitbucket. Imageautomation gets stuck starting with image-automation-controller command error on refs/heads/master: failed to update ref . Manually pushing to the same repo seems to be a temporary fix

PaulSxxxs commented 1 year ago

@tobiasjochheimenglund We're seeing the same behavior's; image updater is returning on-fast-forward update: refs/heads/master while --feature-gates=OptimizedGitClones=false, but image updated so seem to be processing; I'm about to test this more thoroughly and will report back. For us too, the error disappears when someone makes a commit which is thankfully quite frequent.

@dewe @mantasaudickas Do you have any more technical details you sent to bitbucket to push the problem onto them if it does seem to be bitbucket specific? Bitbucket didnt resolve for us either, though were helpful and pointed me towards git shallow clone potentially causing the issue. My support ticket was less technical and more a query about shallow clone, repo health and fluxcd.

I'll report back with any findings.

mantasaudickas commented 1 year ago

They did not asked for any technical details.. all their communication sounded more like: please check that, or that and maybe we can do GC for your repo.. and I did not heard from them since last Friday :) Not sure how shallow clone should cause such an issue - but sounds like just another: "we don't know what is happening and Git is not supposed to keep your history at all" :D

youest commented 1 year ago

Hello there, we have the same issue. our configuration is multicluster with different branches on the same Bitbucket repo. We are encountering this error only on one branch/cluster but not in others, at least until now. is there any configuration we can add or change to have more details to investigate the problem? for example, does it make sense to increase the log level?

dewe commented 1 year ago

@PaulSxxxs At the same time we get object not found, bitbucket pipeline doesn't trigger automatically as expected. When trying to start the pipeline manually it fails with "we couldn't clone the repository". There's definitely a correlation here. We have reported about the pipeline triggering problem, but got no actual response other than "Our engineering team is currently investigating this further".

PaulSxxxs commented 1 year ago

We had a similar issue committing from any git client for a time when "object not found" was occurring. It definitely felt like some sort of lock but we couldn't figure it out, perhaps because it's a bitbucket issue.

PaulSxxxs commented 1 year ago

I received this message from Bitbucket:

Syahrul commented:

G'day, Paul

A quick update on this issue.

Our development team noticed a pattern with the FluxCD issue with Bitbucket cloud. After thorough analysis, it has been determined that the issue is most likely caused by the go-git library being used by FluxCD. This library prematurely closes the connection before a push operation is completed.

To address this matter, we will release a fix tomorrow to mitigate the problem. Once the mitigation process is complete, we will provide you with an update.

We appreciate your patience and encourage you to reach out if you have any additional questions.

– Mohammad Syahrul Support Engineer APAC, Bitbucket cloud

mantasaudickas commented 1 year ago

Wondering if "object not found" issue will be fixed, or its related to something else :)

PaulSxxxs commented 1 year ago

I specifically spoke to them about "object not found" and gave some technical details ... i'm fairly sure it will fix this.

dewe commented 1 year ago

Can't find any apparently related issue over at go-git... 🤔

hiddeco commented 1 year ago

As I happen to be a go-git maintainer as well, we would be really happy to see an issue being created in go-git with steps to reproduce (or any details they can share about how they determined the connection to be closed prematurely).

gregawoods commented 1 year ago

We use both flux and bitbucket and have been absolutely pulling our hair out over this issue. For what it's worth we found that moving from https:// to ssh:// git URLs seemed to make the behavior go away. That isn't always practical to do however so here's hoping that bitbucket's fix works out.

PaulSxxxs commented 1 year ago

Bitbucket rolled their fix, and for us everything has been working perfectly again.

mantasaudickas commented 1 year ago

Yeah... I reverted my manifests as well to bitbucket, so it works - but I am again getting "object not found" messages :)

tun0 commented 12 months ago

Given that these object not found errors tend to get followed up by a successful reconciliation, it seems there's already some retry mechanism in place. Depending on the details of that retry logic, it might make sense to just "ignore" object not found error, unless it persists after several retries?

pjbgf commented 12 months ago

@tun0 Image Automation Controller would automatically retry on the next reconciliation, so yes it should be safe to disregard the "one-off" object not found error.

However, if you do find a pattern where you can reliably reproduce the issue, please report it upstream so it can be investigated and fixed.

mantasaudickas commented 12 months ago

However, if you do find a pattern where you can reliably reproduce the issue

It is still happening with Bitbucket Cloud :)

hiddeco commented 12 months ago

Then please report it upstream with more details around any patterns you observe while the error occurs (or e.g. information about the contents of your repository, size, etc.)?

There is little we can do from within the context of this repository, and it really has to be addressed there. Thanks for your cooperation.

fluxcd / image-automation-controller

Lack of context for failing reconciliation #498