elastic / apm-agent-go

https://www.elastic.co/guide/en/apm/agent/go/current/index.html
Apache License 2.0
418 stars 192 forks source link

Logrus hook causing performance issues #1422

Open renta opened 1 year ago

renta commented 1 year ago

Describe the bug It is not a bug report but an attempt to warn on harm from code of logrus logger hook. https://github.com/elastic/apm-agent-go/blob/main/module/apmlogrus/hook.go In our case, this code produce a 5 second performance penalty on API endpoints which produces errors (like http 404 error). Moreover, since this hook was a part of a custom legacy library, it was hard to find a reason of such behavior. But switching off a tracer integration removed the lag, so an Apm Agent and tracing was a first suspect.

To Reproduce Steps to reproduce the behavior:

  1. Switch on the hook.
  2. Make a http middleware or grpc interceptor which will create a critical log entry from an error of a main server function. I know, this is silly decision, but please take into account that I'm talking about the legacy code and library.
  3. Use a broken link to an APM hostname or a loaded ELK APM tracer on a loaded production service.
  4. Requests with errors would have a 5 seconds performance penalty.

Expected behavior I think the code of this hook should be deprecated and abandoned in the future. My points for this decision:

axw commented 1 year ago

apmlogrus attempts to flush for up to 5 seconds on fatal level log messages: https://github.com/elastic/apm-agent-go/blob/c953eed90088f712eb7ff04203d9359081912047/module/apmlogrus/hook.go#L118-L120

Given that the default behaviour for fatal level logging is for logrus to exit the process, it seems like a reasonable behaviour. You can override the timeout by setting Hook.FatalFlushTimeout: https://github.com/elastic/apm-agent-go/blob/c953eed90088f712eb7ff04203d9359081912047/module/apmlogrus/hook.go#L67

dmathieu commented 1 year ago

Thank you for this issue. I just renamed it to have a more meaningful (and less clickbait-y) title.

renta commented 1 year ago

I admit that we've faced performance issue because of some silly decisions in an old library and misconfigured logger. But I also want to point out that it's a bad system design to rule one infra subsystem from another with some hookish code. It's not obvious for people who would debug same problems that the root of such a lag is a logger but not the tracer or APM agent itself.