gabime / spdlog

Fast C++ logging library.
Other
24.11k stars 4.5k forks source link

Dead-lock of async_logger on some architectures? #1879

Closed TheQuantumPhysicist closed 3 years ago

TheQuantumPhysicist commented 3 years ago

There seems to be a deadlock bug in spdlog that happens randomly on some architectures. On my workstation (Ryzen 3900x), it's consistently fine. But on my public root server (AMD Opteron Processor 3365), sometimes it blocks and sometimes it doesn't. I attached gdb to the running process and it seems like it's blocking at the thread-pool queuing function. The following is the stack trace when it's blocked:

#0  futex_wait (private=<optimized out>, expected=11, futex_word=0x55f0c6f99d98)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=11, futex_word=0x55f0c6f99d98)
    at ../sysdeps/nptl/futex-internal.h:135
#2  __condvar_quiesce_and_switch_g1 (private=<optimized out>, g1index=<synthetic pointer>, wseq=<optimized out>,
    cond=0x55f0c6f99d88) at pthread_cond_common.c:416
#3  __pthread_cond_signal (cond=0x55f0c6f99d88) at pthread_cond_signal.c:78
#4  0x00007fc260cb63d9 in std::condition_variable::notify_one() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000055f0c4606bda in spdlog::details::mpmc_blocking_queue<spdlog::details::async_msg>::enqueue (
    this=0x55f0c6f99d60, item=...)
    at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/details/mpmc_blocking_q.h:39
#6  0x000055f0c45edd54 in spdlog::details::thread_pool::post_async_msg_ (this=0x55f0c6f99d60, new_msg=...,
    overflow_policy=spdlog::async_overflow_policy::block)
    at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/details/thread_pool-inl.h:80
#7  0x000055f0c45edbfe in spdlog::details::thread_pool::post_log (this=0x55f0c6f99d60, worker_ptr=..., msg=...,
    overflow_policy=spdlog::async_overflow_policy::block)
    at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/details/thread_pool-inl.h:58
#8  0x000055f0c45ee58e in spdlog::async_logger::sink_it_ (this=0x55f0c6f03840, msg=...)
    at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/async_logger-inl.h:31
#9  0x000055f0c45ec995 in spdlog::logger::log_it_ (this=0x55f0c6f03840, log_msg=..., log_enabled=true,
    traceback_enabled=false) at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/logger-inl.h:170
#10 0x000055f0c46b91db in spdlog::logger::log_<fmt::v7::basic_string_view<char>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=0x55f0c6f03840, loc=..., lvl=spdlog::level::info, fmt=..., args#0=..., args#1=...)
    at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/logger.h:342
#11 0x000055f0c468e828 in spdlog::logger::log<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=0x55f0c6f03840,
    loc=..., lvl=spdlog::level::info, fmt=..., args#0=..., args#1=...)
    at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/logger.h:87
#12 0x000055f0c46625b0 in spdlog::logger::log<char [7], std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (
    this=0x55f0c6f03840, lvl=spdlog::level::info, fmt=..., args#0=..., args#1=...)
    at /home/user/neblio/neblio/wallet/spdlog/include/spdlog/logger.h:93
#13 0x000055f0c4632526 in DefaultLogger::write<char [7], std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (
    this=0x55f0c58007a0 <LoggerSingleton::get()::logger>, severity=spdlog::level::info, fmt=..., args#0=...,
    args#1=...) at /home/user/neblio/neblio/wallet/logging/defaultlogger.h:101
#14 0x000055f0c490aade in LogSourceForwarder::write<char [12], std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=0x7ffc55e41870, severity=spdlog::level::info, fmtStr=..., args#0=...)
    at /home/user/neblio/neblio/wallet/logging/defaultlogger.h:179
#15 0x000055f0c48f6a91 in BindListenPort (addrBind=..., strError="")
    at /home/user/neblio/neblio/wallet/net.cpp:1743
#16 0x000055f0c476a942 in Bind (addr=..., fError=false) at /home/user/neblio/neblio/wallet/init.cpp:300
#17 0x000055f0c4772c81 in AppInit2 () at /home/user/neblio/neblio/wallet/init.cpp:766
#18 0x000055f0c476a329 in AppInit (argc=3, argv=0x7ffc55e42c48) at /home/user/neblio/neblio/wallet/init.cpp:252
#19 0x000055f0c476a78f in main (argc=3, argv=0x7ffc55e42c48) at /home/user/neblio/neblio/wallet/init.cpp:271

The code that does the call (the abstraction layer of logging) is this (in case I'm doing something wrong, though the usage is very simplistic): https://github.com/NeblioTeam/neblio/blob/VIUTests/wallet/logging/defaultlogger.h

This is the branch where I'm calling this code. So you can see exactly what's happening.

If you need any additional info, please let me know.

TheQuantumPhysicist commented 3 years ago

I tried many things in mpmc_blocking_queue:

  1. Locking at the constructor
  2. Using a recursive_mutex instead of a mutex
  3. Removing that mingw32 thing that unlocks before notify

None of that helped. Let me also add that thread sanitizer was screaming the whole time with many different issues. I thought these are false positives, but seems like there's a big synchronization issue there.

gabime commented 3 years ago

How this can be reproduced ? I never encountered this before. Also, the link you provided points to master branch which is obsolete. Please use v1.x branch

TheQuantumPhysicist commented 3 years ago

I am using the latest version from releases, not master. I just used master to link things here. As you can see in the repo I linked of neblio (the program that uses spdlog), I included the latest spdlog release version and I'm using it header only.

How to produce this? It's really difficult, because it happens only in a certain environment (my public server, and we have a few customers who complained about this). I may try to isolate everything in a minimal program, but the problem is that I'm traveling in two days, so I may not be available in full capacity.

The good news is that logging is isolated in neblio into one file (DefaultLogger.h) + the initialization (InitLogging function).

So, how this happens in neblio, is that it first runs the InitLogging function, then immediately, the first logging call after that freezes everything. I use Debian Buster, 10.8, and this is the compiler information: gcc (Debian 8.3.0-6) 8.3.0.

One important thing to keep in mind is that thread sanitizer showed many issues even before I faced this, and I think they all should be tackled. Today was the first day I looked into spdlog's code... besides me leaving soon, I don't believe I have the experience to solve all these issues.

If you want to attempt to reproduce the issue, clone the neblio repo, build with cmake, then run nebliod. Check the log file produced. If it stopped after

[2021-03-16 20:09:30.532] [info] [init.cpp:194] [void InitLogging()]: Initialized logging successfully!

and nothing happens, then the issue has happened.

Again, in my workstation, it works fine. In my public server, this happens. Please let me know if you need more info.

gabime commented 3 years ago

One important thing to keep in mind is that thread sanitizer showed many issues even before I faced this,

It is a known issue that TSan produces false positives with cpp11 (e.g. https://github.com/gabime/spdlog/pull/789 and https://stackoverflow.com/questions/37552866/why-does-threadsanitizer-report-a-race-with-this-lock-free-example)

Regarding the problem, you could try to replace std::thread::hardware_concurrency() with 1 and see if it helps (long shot, but worth a try).

Otherwise, If there is no way to reproduce I will close this issue.