erlang-lager / lager

A logging framework for Erlang/OTP
Apache License 2.0
1.13k stars 455 forks source link

Lager performance improvements [JIRA: RIAK-1883] #273

Closed robertkowalski closed 9 years ago

robertkowalski commented 9 years ago

Hi and first of all thanks for lager!

I observed that lager is one of the main bottleneck in our database http handlers (see [1]). The database operations at the end of the x axis seem to take a minority of the time it take us to log (see at the end of the flamegraph which is also provided as svg in [2])

It seems that if we try to let lager format our log message, it takes a huge amount of time (see [2]) - but if we pass in a simple list, it gets really fast again (~4% speed improvement)

Are there any ways to speed that up from the lager side which does not require us to handcraft the log message using a list?

  1. Maybe we can move the format call for log messages [3] into the gen_event_server [4] and by doing that decouple it from the client call?
  2. Maybe we can work on the formatter making it more performant?

What do you think?

[1] https://github.com/apache/couchdb-couch/pull/57 [2] http://robert-kowalski.de/assets/data/2015-06-01-erlang-perf-2/flame--second--unpatched.svg [3] https://github.com/basho/lager/blob/95f5a133af6b5b42635f7079b626ac1886eccde2/src/lager.erl#L109 [4] https://github.com/basho/lager/blob/95f5a133af6b5b42635f7079b626ac1886eccde2/src/lager.erl#L117

jadeallenx commented 9 years ago

Seems like the root of the perceived slowness is in lager_trunc_io - which is not necessarily built for performance but for safety. One of the reasons it exists is so can truncate long (>4096 bytes) log messages which get sent by OTP stuff (gen_event or gen_fsm for example) which dumps its ENTIRE state when it fails. That might be upwards of 10MB or more depending on what its processing. As @Vagabond explains in his tech talk about lager, those types of error messages are ticking time bombs for your application availability.

The forthcoming lager 3.0 branch we are preparing to release does add some performance benefits, most clearly seen when there is more than one sink configured for messages to drain. I know it may not directly address your performance concerns about message formatting, but our benchmarks do show a modest improvement in logging performance in the 3.x branch even when there's only 1 sink configured.

All of that not withstanding, I am definitely interested in wringing more performance out of lager. So I am all ears for ideas (or PRs) you have for the formatting code which retains its safety characteristics and is also speedier.

cc @macintux

jadeallenx commented 9 years ago

I should also say that in general, I am :-1: about moving formatting into the gen_event server. One of the reasons we're inlining that in the parse transform is to distribute the effort of formatting across many processes instead of one bottleneck server which is already probably busy with other stuff to manage.

Vagabond commented 9 years ago

Yeah, formatting in the caller can definitely save you when you're handling many events or using many backends.

trunc_io may not be as fast as it could be, what kind of thing are you logging?

macintux commented 9 years ago

Perhaps we could add an unsafe_format function for people who think they won't be sending stupidly-large lists. lager:info(..., [be_fast_not_safe])

robertkowalski commented 9 years ago

Thank you for your answers! I did not knew why the formatting is not part of the gen_event server, thanks for explaining.

@Vagabond we are just logging requests that the server gets:

    couch_log:notice("~s - - ~s ~s ~B", [
        Peer,
        MochiReq:get(method),
        MochiReq:get(raw_path),
        Code
    ]),

resulting in these log messages:

2015-06-22 18:23:10.313 [notice] node1@127.0.0.1 <0.2910.0> 127.0.0.1 - - GET /_nodes/_all_docs?limit=21 304
Vagabond commented 9 years ago

I would be ok with a 'unsafe format' option for people who know a particular log message should be bounded in size and not accidentally log giant amounts of data.

jadeallenx commented 9 years ago

@robertkowalski @Vagabond Please take the branch unsafe_format for a spin.

We intend for for end-users to invoke this behavior by adding _unsafe to the severity level which will be handled at parse transform time. If modifying your code in that way for benchmarking purposes is tedious, you can replace the usual FormatFun() here with the commented out function.

If things look good and y'all are :+1: with this, we will add it to the forthcoming 3.0 release.

robertkowalski commented 9 years ago

@mrallen1 wow this is exciting! I was able to measure a 4% performance improvement with your branch compared to the same branch in a version which did not have https://github.com/basho/lager/commit/512fac4a210f3a2646ff340590d365a91ef48417.

awesome thank you so much!

robertkowalski commented 9 years ago

I am :+1: - @Vagabond what do you think?

robertkowalski commented 9 years ago

oh and I tested it by pulling in that branch and modifying the calls to lager with appending an _unsafe

Vagabond commented 9 years ago

It looks good, although I have not tested it. The XXX Fixme in the comments looks a little dodgy though, although I don't understand the comment.

jadeallenx commented 9 years ago

@Vagabond We are going to remove the XXX FIXME before we merge into the release. :)

jadeallenx commented 9 years ago

Just opened #278 to get this into the 3.x branch.

jadeallenx commented 9 years ago

3.0 is merged on master now. Closing this.

robertkowalski commented 9 years ago

Awesome thank you!