gabime / spdlog

Fast C++ logging library.
Other
23.94k stars 4.5k forks source link

Logger hangs with multi-threaded sinks #3104

Closed Tinyik closed 3 months ago

Tinyik commented 3 months ago

Description

Occasionally, we've notices cases in our application where two adjacent calls to SPDLOG_INFO have a timestamp gap of several minutes, which seems to suggest a hang in our usage of spdlog.

If I understand it correctly from the documentation, SPDLOG_INFO just calls into the default logger, which in our app is shared between two threads, and initialized with two multi-threaded sinks:

  ...
// Various output sinks go here.
  std::vector<spdlog::sink_ptr> sinks;

  // Add file sink
  if (!logfile_path.empty()) {
    sinks.push_back(
        std::make_shared<spdlog::sinks::basic_file_sink_mt>(logfile_path, false));
  }

    // Add console sink
  if (with_console || sinks.empty()) {
    sinks.push_back(std::make_shared<spdlog::sinks::stdout_color_sink_mt>());
  }

  ...

  logger = std::make_shared<spdlog::logger>(name, sinks.begin(), sinks.end());  <-------------------

  // Set logging level
  logger->set_level(static_cast<spdlog::level::level_enum>(PLN_LOG_LEVEL));
  logger->flush_on(static_cast<spdlog::level::level_enum>(PLN_LOG_LEVEL));

  // Replace default logger
  spdlog::set_default_logger(logger);     <-------------------

The application is linked against a static library, whose initializer takes a log function. In main(), the app passes a wrapper function around SPDLOG_INFO to the library initializer so that it can log from a dynamically created thread (owned by the library), thus sharing the same logger with the app's main thread.

In this setup, is there any obvious cause for such hangs? It seems the logger is thread safe as long as its sinks are thread safe (with the _mt suffix?

Thank you and any help will be appreciated!

Version

We are using an older version of spdlog: v1.10.0.

tt4g commented 3 months ago

If spdlog is called many times in a short period of time, it is expected to consume time in critical sections used for thread-safety. However, that is not necessarily the reason for a gap of several minutes in the log time.

Is there any reason to be certain that it is being caused by spdlog? Is there some basis for this, such as “the problem did not reproduce itself when using a single sink.”

tt4g commented 3 months ago

Assuming that the hang is in spdlog, it is likely that spdlog hangs because it is taking time to acquire a lock on the std::mutex that it uses for thread-safety. Another possibility is that the operation to write the log to disk is taking a long time. However, these are OS processes, so spdlog would not work around the problem.

These investigations are not easy, so first we need to be sure that it is indeed spdlog that is hanging.

Tinyik commented 3 months ago

@tt4g thanks for the response!

Is there any reason to be certain that it is being caused by spdlog?

I believe this is the only explanation since the hang happens between back-to-back spdlog invocation with nothing in between.

Do you see anything obviously wrong in the way spdlog is initialized and used, in the snippet above? I assume spd::logger is thread safe in this setup?

I wasn't able to find documentation about it, the only remark I found on wiki is about basic_logger_mt, which does support multi-threading, but only writes output to file.

tt4g commented 3 months ago

Do you see anything obviously wrong in the way spdlog is initialized and used, in the snippet above? I assume spd::logger is thread safe in this setup?

It is not possible to determine from the snippet whether static_cast<spdlog::level::level_enum>(PLN_LOG_LEVEL) is a safe operation. Otherwise, it seems to be calling the thread-safe API correctly.

If the log times are off on two logging API calls, you need to find out what is hanging. I am using two sinks, can I reproduce the problem using only one of the sinks? If the problem only occurs with one of the sinks, then there may be a problem with slow writes to that sink's output destination, etc.

NOTE that spdlog gets the current time from the system clock, so if the system clock changes, the timestamp printed in the log may also shift significantly.

tt4g commented 3 months ago

FYI.

If you are passing STDOUT (stdout_color_sink_mt) to another program, it is possible that the hang is caused by writing to stdout. For example, if you use command line pipe, UNIK Like OS provides a buffer for the pipe, but if the pipe buffer is consumed slowly, the buffer will be full and write(2) will block until the buffer is consumed.

https://man7.org/linux/man-pages/man7/pipe.7.html

A pipe has a limited capacity. If the pipe is full, then a write(2) will block or fail, depending on whether the O_NONBLOCK flag is set (see below).

Tinyik commented 3 months ago

Hi @tt4g , we are not passing the sink to another program (just another thread in a dylib), and I doubt the write to pipe can cause a hang up to minutes. In addition, we use PTP for clock sync on the machine so that also rules out it being caused by massive drift in timestamp; to clarify, we noticed the hang through other out-of-band channels and not just based on the timestamp.

I have not been able to attach to the program at the exact time when the hang happens yet, since this is fairly hard to repro and the hang often resolves itself in a few minutes. I am going to try to reproduce this, and perhaps upgrading the spdlog version may resolve the issue as well?

Thank you :)

tt4g commented 3 months ago

I hope the upgrade does not reproduce the problem, as it is extremely difficult to determine the cause of a problem that does not reproduce.

Tinyik commented 3 months ago

Hi @tt4g , as you suggested above I looked more into the possibility of a blocking pipe write.

During some runs we are putting the application in a screen session, and as it turns out that if the screen session is put in scroll mode (Ctrl+a q) or (Ctrl+a [) it will block the output pipe that the console sink is writing to, after the pipe buffer is full. This seems to effectively block spdlog for indefinitely period of time. I was able to reproduce this in a toy example.

I'm guessing we can avoid this problem by using the async logger, which will probably still block the background thread in this case but at least won't hang the main application thread.

Do you think this is in line with your expectation/suggestion?

tt4g commented 3 months ago

I am not familiar with the GNU screen implementation, but I suspect that the output buffer is full and write(2) is blocked. I found a very similar question: https://unix.stackexchange.com/questions/325588/prevent-gnu-screen-from-blocking-standard-out

Tinyik commented 3 months ago

Thanks! I assume this blocking behavior is going to happen with async logger as well (blocking the bg thread)?

tt4g commented 3 months ago

This problem is reproduced using the aync logger. This is because the Linux Kernel blocks writes unless the STDOUT buffer is consumed.

Tinyik commented 3 months ago

@tt4g Yes that's what I would expect. Since this is not an issue with spdlog, closing this ticket. Thank you very much for the quick turnaround and all the help!! All the best.