PlatformLab / NanoLog

Nanolog is an extremely performant nanosecond scale logging system for C++ that exposes a simple printf-like API.
Other
3k stars 343 forks source link

How to tune NanoLog to better handle burst? #46

Closed wizwx closed 3 years ago

wizwx commented 4 years ago

I'm testing NanoLog in a program, and noticed that roughly half of the messages were dropped in the log.

Here is the setting of my testing program. The program is multithreaded, and one of the worker threads is reading some msg (roughly 0.5k bytes per msg) from socket and log it and then process it. During the test, I read roughly ~160 msgs within 2 min from the socket, and logged all of them. However, only ~80 of them showed up in the log file. One of the cases I noticed is that the program handled ~10 msgs within 2 mini second, and only the first and last of the 10 msgs were successfully logged.

If I tweak the program to sleep for 1 second or call NanoLog::sync before logging the msg, all ~160 messages will be logged successfully.

So I looked at the runtime/Config.h file, and tried the following two settings, but strangely enough I saw no improvement and still roughly half of the messages were dropped. STAGING_BUFFER_SIZE = 1<<25 RELEASE_THRESHOLD = 1<<19

STAGING_BUFFER_SIZE = 1<<25 RELEASE_THRESHOLD = 1<<20

Any suggestions on how I should tune the parameters to better handle burstiness? Thank you.

syang0 commented 4 years ago

Hm, that is very strange. The statistics you gave should be well within NanoLog's performance envelope. Even if the messages were 1KB (rounded up), that should only result in 160KB of space, and the default configuration allocates ~1MB of staging buffer space per thread. There should be enough space to handle your workload. Additionally, NanoLog is a guaranteed logger, so all statements should have executed before continuing to the next line of code in your application.

Is there some source code you could point me to that would reproduce this bug?

In the mean time, could you perform some sanity checks for me:

  1. Does the code path with the NanoLog statement execute for sure (i.e. if you add a regular printf("%d\r\n", counter++) next to the NanoLog statement, does it execute?).
  2. Is sync() executed before the application exits to dump the remaining log statements to disk?
  3. Is there any pattern to the log statements that made it to disk? For example, do all the log statements only come from one thread when multiple are logging, are all the log statements at the head/tail of the execution, or are they only dropped in a burst?
wizwx commented 4 years ago

I will have to dig further to understand why.

For your sanity check questions,

  1. the NanoLog is for sure executed. If I simply add NanoLog::sync or sleep(1) before the log statement, none of the logged message will be missing in the log file; if not then about half of the messages are missing; The reason I noticed the messages are missing is because I count the message processed and the count does not match with the messages extracted from the logs;

  2. I registered atexit to call NanoLog::sync. Also the messages are logged in the same thread and the missing messages can be in the middle, not always at the end.

  3. All the missing messages come from the same thread. the dropped messages seem to be related burstiness, as in one case i see 7 out of 9 messages are dropped, and the 9 messages are on the socket within 2 mini second. I just noticed the issue and I did not have a chance to drill down further yet to confirm any pattern. Also I actually play/logged another 80 messages with roughly the same size from another socket and thread before playing/recording the 160 messages (on another socket/thread). For the first 80 messages, I never see any missing messages in the log file.

Another strange thing is that it does not help if I add NanoLog::sync right after the NANO_LOG statement, and still about half of the messages are dropped in that case. I have to add NanoLog::sync right before the Log statement and then no more missing messages.

I will have to play with it a little bit more and see what I can get. Will let you know if any finding.

wizwx commented 3 years ago

There is no bug in NanoLog. In the same thread where I logged the received message, I had another log that has a typo in the format string, where "%s" is used for an integer value by mistake. So if several messages are encoded in one call of encodeLogMsgs in the compression thread, all messages after the log message with wrong format are lost in the decompressor.