aws / aws-application-networking-k8s

A Kubernetes controller for Amazon VPC Lattice
https://www.gateway-api-controller.eks.aws.dev/
Apache License 2.0
175 stars 50 forks source link

Log lines have inconsistent log levels and formats #602

Open shulin-sq opened 9 months ago

shulin-sq commented 9 months ago

error logs are logged with info level, making it difficult to search through log lines to see what failed vs what are just informational lines


{"level":"info","ts":"2024-02-14T21:49:12.387Z","logger":"controller.route","caller":"controllers/route_controller.go:155","msg":"reconcile er
ror","name":"traffic-exemplar","message":"error during service synthesis failed ServiceManager.Upsert REDACTED due to
 ValidationException: customDomainName is required when certificateArn is provided.\n{\n  RespMetadata: {\n    StatusCode: 400,\n    RequestID
: \"1e3f12ab-4038-4212-9d94-f2f186be5c1b\"\n  },\n  FieldList: [{\n      Message: \"customDomainName is required when certificateArn is provid
ed.\",\n      Name: \"certificateArn\"\n    }],\n  Message_: \"customDomainName is required when certificateArn is provided.\",\n  Reason: \"f
ieldValidationFailed\"\n}"}

using version 1.0.1 but this hasn't changed in main branch either: https://github.com/aws/aws-application-networking-k8s/blob/main/pkg/controllers/route_controller.go#L155

Can warn/error logs be used more consistently? eg warn for errors that are ignored and error for errors that actually cause a reconciliation failure?

Another idea is to use some consistent messaging in log lines to indicate that something errored. eg if "failed_reconcile" will always include the error message of why the reconciliation failed that would be easy to search.

erikfuller commented 8 months ago

Thanks for highlighting this @shulin-sq. I think the challenge is that the log levels don't easily translate well to the eventually consistent/eventually reconcile nature of the controller.

Definitively categorizing errors can be challenging, for example API calls to Lattice would need to be subdivided into 400/500 errors, but sometimes a 4XX can happen for a ResourceNotFound which would be fixed after another reconcile attempt. Other 4XXs will never succeed, such as an invalid parameter (like a malformed ARN).

The rough approach now is that "everything" is an info unless it's known to be an unrecoverable error, in which case we use error. I think it's fine to challenge this, or advocate for specific log lines to be error vs info. If there are other controllers out there which have a more standardized approach you prefer, or if you have a more specific proposal on how to change these, I'm happy to consider it.

shulin-sq commented 8 months ago

@erikfuller

I would suggest

I did look around to see if there are some other projects we can use as an example but am coming up short. These are projects I checked to see if we can follow some guidelines (but not very helpful). However they do mention the usage of contexted log messages, which related to my other opened issue.

erikfuller commented 8 months ago

@shulin-sq that sounds reasonable enough. We may want to put it in even simpler terms that make it easier to determine log level correctness just looking at the code rather than fully knowing the context. For example:

Otherwise, I've certainly seen DEBUG logs running the controller locally, I'll double check I can see them when running via a deployment. I wasn't using JSON formatting though, but I'll try it out.

shulin-sq commented 8 months ago

@erikfuller That sounds reasonable. What are next steps?

erikfuller commented 8 months ago

@shulin-sq I think we have a few options. We can start by publishing a version this guidance in the contributing/ docs, then reference it for any new changes. We could start working our way through the codebase, even a file at a time would be better than no progress. Or we could start with just getting error right at the top level reconciliation.

If you have a priority order of lines you care about most that might be another way to tackle it.

atm I don't have a date when someone here would be able to pick this up. For now, let me go ahead and re-tag it.