Closed robertkowalski closed 9 years ago
Seems like the root of the perceived slowness is in lager_trunc_io - which is not necessarily built for performance but for safety. One of the reasons it exists is so can truncate long (>4096 bytes) log messages which get sent by OTP stuff (gen_event or gen_fsm for example) which dumps its ENTIRE state when it fails. That might be upwards of 10MB or more depending on what its processing. As @Vagabond explains in his tech talk about lager, those types of error messages are ticking time bombs for your application availability.
The forthcoming lager 3.0 branch we are preparing to release does add some performance benefits, most clearly seen when there is more than one sink configured for messages to drain. I know it may not directly address your performance concerns about message formatting, but our benchmarks do show a modest improvement in logging performance in the 3.x branch even when there's only 1 sink configured.
All of that not withstanding, I am definitely interested in wringing more performance out of lager. So I am all ears for ideas (or PRs) you have for the formatting code which retains its safety characteristics and is also speedier.
cc @macintux
I should also say that in general, I am :-1: about moving formatting into the gen_event server. One of the reasons we're inlining that in the parse transform is to distribute the effort of formatting across many processes instead of one bottleneck server which is already probably busy with other stuff to manage.
Yeah, formatting in the caller can definitely save you when you're handling many events or using many backends.
trunc_io may not be as fast as it could be, what kind of thing are you logging?
Perhaps we could add an unsafe_format
function for people who think they won't be sending stupidly-large lists. lager:info(..., [be_fast_not_safe])
Thank you for your answers! I did not knew why the formatting is not part of the gen_event server, thanks for explaining.
@Vagabond we are just logging requests that the server gets:
couch_log:notice("~s - - ~s ~s ~B", [
Peer,
MochiReq:get(method),
MochiReq:get(raw_path),
Code
]),
resulting in these log messages:
2015-06-22 18:23:10.313 [notice] node1@127.0.0.1 <0.2910.0> 127.0.0.1 - - GET /_nodes/_all_docs?limit=21 304
I would be ok with a 'unsafe format' option for people who know a particular log message should be bounded in size and not accidentally log giant amounts of data.
@robertkowalski @Vagabond Please take the branch unsafe_format for a spin.
We intend for for end-users to invoke this behavior by adding _unsafe
to the severity level which will be handled at parse transform time. If modifying your code in that way for benchmarking purposes is tedious, you can replace the usual FormatFun()
here with the commented out function.
If things look good and y'all are :+1: with this, we will add it to the forthcoming 3.0 release.
@mrallen1 wow this is exciting! I was able to measure a 4% performance improvement with your branch compared to the same branch in a version which did not have https://github.com/basho/lager/commit/512fac4a210f3a2646ff340590d365a91ef48417.
awesome thank you so much!
I am :+1: - @Vagabond what do you think?
oh and I tested it by pulling in that branch and modifying the calls to lager with appending an _unsafe
It looks good, although I have not tested it. The XXX Fixme in the comments looks a little dodgy though, although I don't understand the comment.
@Vagabond We are going to remove the XXX FIXME
before we merge into the release. :)
Just opened #278 to get this into the 3.x branch.
3.0 is merged on master now. Closing this.
Awesome thank you!
Hi and first of all thanks for lager!
I observed that lager is one of the main bottleneck in our database http handlers (see [1]). The database operations at the end of the x axis seem to take a minority of the time it take us to log (see at the end of the flamegraph which is also provided as svg in [2])
It seems that if we try to let lager format our log message, it takes a huge amount of time (see [2]) - but if we pass in a simple list, it gets really fast again (~4% speed improvement)
Are there any ways to speed that up from the lager side which does not require us to handcraft the log message using a list?
What do you think?
[1] https://github.com/apache/couchdb-couch/pull/57 [2] http://robert-kowalski.de/assets/data/2015-06-01-erlang-perf-2/flame--second--unpatched.svg [3] https://github.com/basho/lager/blob/95f5a133af6b5b42635f7079b626ac1886eccde2/src/lager.erl#L109 [4] https://github.com/basho/lager/blob/95f5a133af6b5b42635f7079b626ac1886eccde2/src/lager.erl#L117