Closed dmarkey closed 3 years ago
Usually fluxd will end up skipping over problematic manifests and applying the rest, rather than stalling. Can you give an example which causes a timeout? Is it a specific kind of problem with manifests that makes it happen?
Heres an example:
ts=2019-01-02T21:34:23.725646505Z caller=loop.go:118 component=sync-loop jobID=9e304438-e613-cca2-cf8f-180559731a53 state=done success=false err="applying changes: Traceback (most recent call last):\n File \"kubeyaml.py\", line 226, in
It may be a more general problem. I have an EKS cluster in us-west-2 and the latency between there and Dublin may be causing a problem.
Ah OK, that's the update code complaining (at length) that it can't apply a change you made because the YAML is malformed (or there's a bug in that bit of the update code; but it looks more like the former).
You are absolutely right that the error message could be more accessible! Here it's relying on the error returned from a Python library -- which seems to be returning the whole stack trace. Perhaps a first step would be to fish out the substance of the problem (and the location).
Oh well my problem isn't with the stack trace in the logs, it's the timeout error that I find most annoying :)
You didn't give an example of the timeout -- is it what you get when using fluxctl
, with the example posted being the log message at the time of the fluxctling?
Got same error today:
fluxctl sync
Synchronizing with git@github.com:some/secret.git
Failed to complete sync job (ID "bdeb4312-9560-3b8b-8324-616a3cf5ff99")
Error: timeout
Run 'fluxctl sync --help' for usage.
Timed out after a minute, should I set git-timeout flag for fluxd to fix this?
I get the same error with fluxctl sync, it seems to time out often.
I got this using Helm 3 and a cluster in Azure.
I got this by installing Flux from help templates on OpenShift in GCP
I am having the same problem. I manually deleted the namespace where my workload was, and now there is no way for flux to catch it and apply, I have done lots of changes and committed, but nothing. And now fluxctl sync
, and it times out.
EDIT
Actually, now I'm realizing that if I do fluxctl list-workloads
, it does tell me what's the error. That's in my scenario though.
fluxctl sync
times out more and more often as the size of my cluster grows larger.
It's completely non-deterministic, however. I have no idea how to remedy this, because it succeeds sometimes.
I have the same issue, checking the logs I got this:
ts=2020-08-06T14:32:46.220230199Z caller=loop.go:108 component=sync-loop err="loading resources from repo: duplicate definition of 'app:ingress/app' (in deployments/app/ingress.yml and deployments/test/ingress.yml)"
it's ok the error because there was a duplicated definition, but why Flux stuck and returns timeout?
Flux version 1.20.0.
Thank you for your reports. Flux v1 is in maintenance mode and can no longer accept breaking changes, including those that modify behavior surrounding logging or output, as systems developers can have built integrations that depend on the specific structure and behavior of the log output and fluxctl
cli. Maintenance mode can only introduce changes that are non-breaking, with focus on security or critical bug fixes.
Development efforts are focused on Flux v2 which has made great inroads towards this goal of helping narrow errors to the source. Please consider upgrading to Flux v2 and continue to report any issues you may find related to usability or otherwise.
https://toolkit.fluxcd.io/core-concepts/ the new architecture is constrained in ways that make surprising failures easier to report on, and tracing errors to their source more straightforward from a human-operator or cluster operator perspective.
in the interest of reducing the number of open issues not directly related to supporting Flux v1 in maintenance mode down to something manageable, I will go ahead and close out this issue for now. Thanks for using Flux!
I've come across annoying behaviour where when there is a problem in the manifests, fluxctl will timeout(usually after a prolonged amount of time). It should catch whatever error and return it to the user instead.