Logrus hook causing performance issues

renta commented 1 year ago

Describe the bug It is not a bug report but an attempt to warn on harm from code of logrus logger hook. https://github.com/elastic/apm-agent-go/blob/main/module/apmlogrus/hook.go In our case, this code produce a 5 second performance penalty on API endpoints which produces errors (like http 404 error). Moreover, since this hook was a part of a custom legacy library, it was hard to find a reason of such behavior. But switching off a tracer integration removed the lag, so an Apm Agent and tracing was a first suspect.

To Reproduce Steps to reproduce the behavior:

Switch on the hook.
Make a http middleware or grpc interceptor which will create a critical log entry from an error of a main server function. I know, this is silly decision, but please take into account that I'm talking about the legacy code and library.
Use a broken link to an APM hostname or a loaded ELK APM tracer on a loaded production service.
Requests with errors would have a 5 seconds performance penalty.

Expected behavior I think the code of this hook should be deprecated and abandoned in the future. My points for this decision:

the hook intended to work synchronously, so it's usage in the servers middlewares and interceptors will slow down responses (in case of busy APM Server)
since the unpredictable calls of this hook, the flushing load to the APM Server is also upredictable. It could nullify Elastic team attempts to optimise the a communication between the APM Server and the Go Agent.
5 seconds timeout context is too long
this hook breaks an isolation of different infrastructure subsystem of the service: logger for some reason decides to flush calls to an APM server. It's quite an unexpectable behavior.
if you want to flush a tracer for some reason then it's better to use Agent's tracer itself or handle panics via https://www.elastic.co/guide/en/apm/agent/go/current/custom-instrumentation.html#custom-instrumentation-errors instead of hooks.

axw commented 1 year ago

apmlogrus attempts to flush for up to 5 seconds on fatal level log messages: https://github.com/elastic/apm-agent-go/blob/c953eed90088f712eb7ff04203d9359081912047/module/apmlogrus/hook.go#L118-L120

Given that the default behaviour for fatal level logging is for logrus to exit the process, it seems like a reasonable behaviour. You can override the timeout by setting Hook.FatalFlushTimeout: https://github.com/elastic/apm-agent-go/blob/c953eed90088f712eb7ff04203d9359081912047/module/apmlogrus/hook.go#L67

dmathieu commented 1 year ago

Thank you for this issue. I just renamed it to have a more meaningful (and less clickbait-y) title.

renta commented 1 year ago

I admit that we've faced performance issue because of some silly decisions in an old library and misconfigured logger. But I also want to point out that it's a bad system design to rule one infra subsystem from another with some hookish code. It's not obvious for people who would debug same problems that the root of such a lag is a logger but not the tracer or APM agent itself.

elastic / apm-agent-go

Logrus hook causing performance issues #1422