Fluxctl sync should return a meaningful error instead of timing out

dmarkey commented 5 years ago

I've come across annoying behaviour where when there is a problem in the manifests, fluxctl will timeout(usually after a prolonged amount of time). It should catch whatever error and return it to the user instead.

squaremo commented 5 years ago

Usually fluxd will end up skipping over problematic manifests and applying the rest, rather than stalling. Can you give an example which causes a timeout? Is it a specific kind of problem with manifests that makes it happen?

dmarkey commented 5 years ago

Heres an example:

ts=2019-01-02T21:34:23.725646505Z caller=loop.go:118 component=sync-loop jobID=9e304438-e613-cca2-cf8f-180559731a53 state=done success=false err="applying changes: Traceback (most recent call last):\n File \"kubeyaml.py\", line 226, in \n File \"kubeyaml.py\", line 221, in main\n File \"kubeyaml.py\", line 53, in apply_to_yaml\n File \"kubeyaml.py\", line 59, in update_image\n File \"site-packages/ruamel/yaml/main.py\", line 363, in load_all\n File \"site-packages/ruamel/yaml/constructor.py\", line 101, in get_data\n File \"site-packages/ruamel/yaml/constructor.py\", line 118, in construct_document\n File \"site-packages/ruamel/yaml/constructor.py\", line 1508, in construct_yaml_map\n File \"site-packages/ruamel/yaml/constructor.py\", line 1414, in construct_mapping\n File \"site-packages/ruamel/yaml/constructor.py\", line 279, in check_mapping_key\nruamel.yaml.constructor.DuplicateKeyError: while constructing a mapping\n in \"\", line 9, column 5\nfound duplicate key \"flux.weave.works/automated\" with value \"true\" (original value: \"true\")\n in \"\", line 12, column 5\n\nTo suppress this check see:\n http://yaml.readthedocs.io/en/latest/api.html#duplicate-keys\n\nDuplicate keys will become an error in future releases, and are errors\nby default when using the new API.\n\nFailed to execute script kubeyaml"

dmarkey commented 5 years ago

It may be a more general problem. I have an EKS cluster in us-west-2 and the latency between there and Dublin may be causing a problem.

squaremo commented 5 years ago

Ah OK, that's the update code complaining (at length) that it can't apply a change you made because the YAML is malformed (or there's a bug in that bit of the update code; but it looks more like the former).

You are absolutely right that the error message could be more accessible! Here it's relying on the error returned from a Python library -- which seems to be returning the whole stack trace. Perhaps a first step would be to fish out the substance of the problem (and the location).

dmarkey commented 5 years ago

Oh well my problem isn't with the stack trace in the logs, it's the timeout error that I find most annoying :)

squaremo commented 5 years ago

You didn't give an example of the timeout -- is it what you get when using fluxctl, with the example posted being the log message at the time of the fluxctling?

yellowmegaman commented 5 years ago

Got same error today:

fluxctl sync
Synchronizing with git@github.com:some/secret.git
Failed to complete sync job (ID "bdeb4312-9560-3b8b-8324-616a3cf5ff99")
Error: timeout
Run 'fluxctl sync --help' for usage.

Timed out after a minute, should I set git-timeout flag for fluxd to fix this?

GODBS commented 5 years ago

I get the same error with fluxctl sync, it seems to time out often.

rbitia commented 5 years ago

I got this using Helm 3 and a cluster in Azure.

Dimss commented 5 years ago

I got this by installing Flux from help templates on OpenShift in GCP

nerusnayleinad commented 4 years ago

I am having the same problem. I manually deleted the namespace where my workload was, and now there is no way for flux to catch it and apply, I have done lots of changes and committed, but nothing. And now fluxctl sync, and it times out.

EDIT Actually, now I'm realizing that if I do fluxctl list-workloads, it does tell me what's the error. That's in my scenario though.

JVMartin commented 4 years ago

fluxctl sync times out more and more often as the size of my cluster grows larger.

It's completely non-deterministic, however. I have no idea how to remedy this, because it succeeds sometimes.

dignajar commented 4 years ago

I have the same issue, checking the logs I got this:

ts=2020-08-06T14:32:46.220230199Z caller=loop.go:108 component=sync-loop err="loading resources from repo: duplicate definition of 'app:ingress/app' (in deployments/app/ingress.yml and deployments/test/ingress.yml)"

it's ok the error because there was a duplicated definition, but why Flux stuck and returns timeout?

Flux version 1.20.0.

kingdonb commented 3 years ago

Thank you for your reports. Flux v1 is in maintenance mode and can no longer accept breaking changes, including those that modify behavior surrounding logging or output, as systems developers can have built integrations that depend on the specific structure and behavior of the log output and fluxctl cli. Maintenance mode can only introduce changes that are non-breaking, with focus on security or critical bug fixes.

Development efforts are focused on Flux v2 which has made great inroads towards this goal of helping narrow errors to the source. Please consider upgrading to Flux v2 and continue to report any issues you may find related to usability or otherwise.

https://toolkit.fluxcd.io/core-concepts/ the new architecture is constrained in ways that make surprising failures easier to report on, and tracing errors to their source more straightforward from a human-operator or cluster operator perspective.

in the interest of reducing the number of open issues not directly related to supporting Flux v1 in maintenance mode down to something manageable, I will go ahead and close out this issue for now. Thanks for using Flux!

fluxcd / flux

Fluxctl sync should return a meaningful error instead of timing out #1624