Open shulin-sq opened 9 months ago
Thanks for highlighting this @shulin-sq. I think the challenge is that the log levels don't easily translate well to the eventually consistent/eventually reconcile nature of the controller.
Definitively categorizing errors can be challenging, for example API calls to Lattice would need to be subdivided into 400/500 errors, but sometimes a 4XX can happen for a ResourceNotFound which would be fixed after another reconcile attempt. Other 4XXs will never succeed, such as an invalid parameter (like a malformed ARN).
The rough approach now is that "everything" is an info
unless it's known to be an unrecoverable error, in which case we use error
. I think it's fine to challenge this, or advocate for specific log lines to be error
vs info
. If there are other controllers out there which have a more standardized approach you prefer, or if you have a more specific proposal on how to change these, I'm happy to consider it.
@erikfuller
I would suggest
[reconcile_result]
to make the text easily searchable.I did look around to see if there are some other projects we can use as an example but am coming up short. These are projects I checked to see if we can follow some guidelines (but not very helpful). However they do mention the usage of contexted log messages, which related to my other opened issue.
@shulin-sq that sounds reasonable enough. We may want to put it in even simpler terms that make it easier to determine log level correctness just looking at the code rather than fully knowing the context. For example:
Otherwise, I've certainly seen DEBUG logs running the controller locally, I'll double check I can see them when running via a deployment. I wasn't using JSON formatting though, but I'll try it out.
@erikfuller That sounds reasonable. What are next steps?
@shulin-sq
I think we have a few options. We can start by publishing a version this guidance in the contributing/
docs, then reference it for any new changes. We could start working our way through the codebase, even a file at a time would be better than no progress. Or we could start with just getting error
right at the top level reconciliation.
If you have a priority order of lines you care about most that might be another way to tackle it.
atm I don't have a date when someone here would be able to pick this up. For now, let me go ahead and re-tag it.
error logs are logged with info level, making it difficult to search through log lines to see what failed vs what are just informational lines
using version 1.0.1 but this hasn't changed in main branch either: https://github.com/aws/aws-application-networking-k8s/blob/main/pkg/controllers/route_controller.go#L155
Can warn/error logs be used more consistently? eg warn for errors that are ignored and error for errors that actually cause a reconciliation failure?
Another idea is to use some consistent messaging in log lines to indicate that something errored. eg if "failed_reconcile" will always include the error message of why the reconciliation failed that would be easy to search.