Unexpectedly stuck queries after upgrade from 24.1.2.5 to 24.8.4.13

ppavlov39 commented 1 month ago

Describe the issue After upgrading the Clickhouse version from 24.1.2.5 to 24.7.3.42, we faced with periodic partial failures of cluster nodes. After some time, it was decided to upgrade to 24.8.4.13. This did not help. Also after it, the new query analyzer was disabled, since it caused many problems when executing previously stable queries (allow_experimental_analyzer: 0).

Most often, the failure does not occur immediately. It is preceded by the growth of several metrics - RWLockActiveReaders, BackgroundMergesAndMutationsPoolTask (no mutations are performed at that time) and the number of parallel queries. Essentially, the failure is caused by one of the cluster nodes starting to hit the limit set by the max_concurrent_queries parameter, and new queries stop being executed.

The queries that are stuck seem to have nothing in common - they could be queries to select data from system tables or user data tables, DDL queries, etc. The tables they use are different. The only solution we could find was to restart the server process.

On metrics it looks likes this (Yellow and orange lines): RWLockActiveReaders:

BackgroundMergesAndMutationsPoolTask:

Clickhouse runs on Centos 7 with elrepo kernel 5.4.224-1.el7.elrepo.x86_64.

How to reproduce We can't reproduce it, it happens unexpectedly.

How can we find out the root cause of this behavior?

Algunenano commented 1 month ago

You can use CH's introspection functions to dump where the threads are stuck:

SET allow_introspection_functions = 1;
SELECT
    thread_name,
    thread_id,
    query_id,
    arrayStringConcat(`all`, '\n') AS res
FROM system.stack_trace
FORMAT Vertical;

ppavlov39 commented 1 month ago

Hello! Today we encountered a problem again. I tried two queries and got very strange results - it was empty:

SELECT
    thread_name,
    thread_id,
    query_id,
    arrayStringConcat(`trace`, '\n') AS res --In our version table doesn't have 'all' column, just 'trace'
FROM system.stack_trace
SETTINGS allow_introspection_functions=1
FORMAT Vertical;

The output of the query was as follows:

Query result

```sql Row 1: | ----------------------------+ thread_name: clickhouse-serv| thread_id: 9861 | query_id: | res: | | Row 2: | ────── | thread_name: clickhouse-serv| thread_id: 9862 | query_id: | res: | | Row 3: | ────── | thread_name: clickhouse-serv| thread_id: 9863 | query_id: | res: | | Row 4: | ────── | thread_name: clickhouse-serv| thread_id: 9864 | query_id: | res: | | Row 5: | ────── | thread_name: TraceCollector | thread_id: 9865 | query_id: | res: | | Row 6: | ────── | thread_name: clickhouse-serv| thread_id: 9868 | query_id: | res: | | Row 7: | ────── | thread_name: clickhouse-serv| thread_id: 9870 | query_id: | res: | | Row 8: | ────── | thread_name: BgSchPool | thread_id: 9871 | query_id: | res: | | Row 9: | ─────── | thread_name: BgSchPool | thread_id: 9872 | query_id: | res: | | Row 10: | ─────── | thread_name: BgBufSchPool | thread_id: 9873 | query_id: | res: | ... And two thousand similar rows ```

The second one:

SELECT
 arrayStringConcat(arrayMap(x -> demangle(addressToSymbol(x)), trace), '\n') AS trace_functions,
 count()
FROM system.stack_trace
GROUP BY trace_functions
ORDER BY count()
DESC
SETTINGS allow_introspection_functions=1
FORMAT Vertical;

That query return this:

Query result

```sql Row 1: | ---------------------+ trace_functions: | count(): 2008| ```

After restarting the server, these queries start returning the following results

The first one results

```sql Row 1: | ----------------------------+ thread_name: clickhouse-serv| thread_id: 5312 | query_id: | res: 276029826 | 140550260069936 | 140550260054582 | 232734779 | 231501570 | 366862054 | 231406736 | 366898471 | 231395046 | 132474574 | 140550256158037 | 103325742 | | Row 2: | ────── | thread_name: clickhouse-serv| thread_id: 5313 | query_id: | res: 276029826 | 140550260069936 | 140550260066142 | 230045464 | 230046816 | 232743752 | 366514333 | 140550260039333 | 140550257060621 | | Row 3: | ────── | thread_name: clickhouse-serv| thread_id: 5316 | query_id: | res: | | Row 4: | ────── | thread_name: clickhouse-serv| thread_id: 5317 | query_id: | res: | ... And so on... ```

And the second one results

```sql Row 1: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ trace_functions: DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*) | | | Poco::Net::SocketImpl::pollImpl(Poco::Timespan&, int) | Poco::Net::SocketImpl::poll(Poco::Timespan const&, int) | DB::ReadBufferFromPocoSocket::poll(unsigned long) const | DB::TCPHandler::runImpl() | DB::TCPHandler::run() | Poco::Net::TCPServerConnection::start() | Poco::Net::TCPServerDispatcher::run() | Poco::PooledThread::run() | Poco::ThreadImpl::runnableEntry(void*) | start_thread | __clone | count(): 834 | | Row 2: | ────── | trace_functions: DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*) | | pthread_cond_wait@@GLIBC_2.3.2 | DB::BackgroundSchedulePool::threadFunction() | void std::__1::__function::__policy_invoker::__call_impl::ThreadFromGlobalPoolImpl>, void ThreadPoolImpl::scheduleImpl(std::__1::function, Priorit| start_thread | __clone | count(): 560 | | Row 3: | ────── | trace_functions: DB::(anonymous namespace)::signalHandler(int, siginfo_t*, void*) | | pthread_cond_wait@@GLIBC_2.3.2 | ThreadPoolImpl>::worker(std::__1::__list_iterator, void*>) | void std::__1::__function::__policy_invoker::__call_impl::ThreadFromGlobalPoolImpl>::scheduleImpl(std| void* std::__1::__thread_proxy[abi:v15007]>, void ThreadPoolImpl::scheduleImpl(std::__1::function, Priorit| start_thread | __clone | count(): 249 | ... And so on... ```

When I ran these queries during the issues, it took minutes (up to 10), whereas after a restart the queries took less than 10 seconds.

den-crane commented 1 month ago

Check sudo dmesg -T for CPU lockup messages. I am 90% sure it's a Linux Kernel bug, and you need to upgrade a kernel.

ppavlov39 commented 1 month ago

The output of sudo dmesg -T command has many strings that looks like this: [Сб сен 28 14:49:58 2024] audit: type=1300 audit(1727524293.135:20815699): arch=c000003e syscall=42 success=no exit=-115 a0=3 a1=7ffdc79d9230 a2=10 a3=7ffdc79d84a0 items=0 ppid=2095 pid=23543 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="<cmd>" exe="<cmd_path>" key=(null)

I've check count of audit messages:

$ sudo dmesg -T | grep audit | wc -l
2263

$ sudo dmesg -T | grep -v audit | wc -l
0

There are no messages related with CPU

ppavlov39 commented 1 month ago

And during today issue i've try check the stack_trace table and saw the same output as before. It was empty. As described in that message.

ppavlov39 commented 1 month ago

I tried to check the system calls with strace. The stuck thread gave the following output:

$ sudo strace -p 31507
strace: Process 31507 attached
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x2468, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[QUIT PIPE TERM]})   = 202
futex(0x7f63697020f8, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 31507 detached
 <detached ...>

And another thread:

$ sudo strace -p 1374
strace: Process 1374 attached
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x8ca, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x8ca, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x8ca, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x8ca, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x8ca, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x8ca, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TIMER, si_timerid=0x8ca, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[PIPE]})             = 202
futex(0x187b78f8, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1374 detached
 <detached ...>

ppavlov39 commented 3 weeks ago

I recently got a stack trace of two stuck threads with this command gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p 15182.

Maybe you could see something useful that will help you figure out the root cause of these problems.

The first one:

Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f72c894254d in __lll_lock_wait () from /lib64/libpthread.so.0

Thread 1 (process 15182):
#0  0x00007f72c894254d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f72c893f1b2 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#2  0x0000000018136fcb in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, un
signed int) ()
#3  0x0000000018134c57 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister(bool) ()
#4  0x00000000181342b2 in unw_step ()
#5  0x000000000db2d579 in StackTrace::StackTrace(ucontext_t const&) ()
#6  0x000000000db5290d in DB::(anonymous namespace)::writeTraceInfo(DB::TraceType, int, siginfo_t*, void*) ()
#7  <signal handler called>
#8  0x00007f72c894254b in __lll_lock_wait () from /lib64/libpthread.so.0
#9  0x00007f72c893f1b2 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#10 0x0000000018136fcb in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, un
signed int) ()
#11 0x0000000018134c57 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister(bool) ()
#12 0x00000000181342b2 in unw_step ()
#13 0x000000000db2d66b in StackTrace::tryCapture() ()
#14 0x000000000db1c238 in MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) ()
#15 0x000000000db1c75d in MemoryTracker::allocImpl(long, bool, MemoryTracker*, double) ()
#16 0x000000000dad2a67 in Allocator<false, false>::realloc(void*, unsigned long, unsigned long, unsigned long) ()
#17 0x0000000007e5f10f in void DB::PODArrayBase<8ul, 4096ul, Allocator<false, false>, 63ul, 64ul>::realloc<>(unsigned long) ()
#18 0x0000000007a06835 in DB::SerializationDecimalBase<DB::Decimal<long> >::deserializeBinaryBulk(DB::IColumn&, DB::ReadBuffer&, unsigned long, double) const ()
#19 0x0000000011176cd4 in DB::ISerialization::deserializeBinaryBulkWithMultipleStreams(COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, DB::ISerialization::DeserializeBinaryBul
kSettings&, std::__1::shared_ptr<DB::ISerialization::DeserializeBinaryBulkState>&, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char
> >, COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic
_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<ch
ar> > const, COW<DB::IColumn>::immutable_ptr<DB::IColumn> > > >*) const ()
#20 0x00000000111ba93d in DB::SerializationNullable::deserializeBinaryBulkWithMultipleStreams(COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, DB::ISerialization::DeserializeBi
naryBulkSettings&, std::__1::shared_ptr<DB::ISerialization::DeserializeBinaryBulkState>&, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocat
or<char> >, COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1
::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::alloc
ator<char> > const, COW<DB::IColumn>::immutable_ptr<DB::IColumn> > > >*) const ()
#21 0x0000000012ffbd54 in DB::NativeReader::read() ()
#22 0x0000000012eaa9d9 in DB::Connection::receiveDataImpl(DB::NativeReader&) ()
#23 0x0000000012ea9971 in DB::Connection::receivePacket() ()
#24 0x0000000012ee4de8 in DB::PacketReceiver::Task::run(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, st
d::__1::allocator<char> > const&, unsigned int)>, std::__1::function<void ()>) ()
#25 0x00000000110ce963 in void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine
> > >(boost::context::detail::transfer_t) ()
#26 0x000000000693d8ef in make_fcontext ()
#27 0x0000000000000000 in ?? ()
[Inferior 1 (process 15182) detached]

And another thread:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f72c894254d in __lll_lock_wait () from /lib64/libpthread.so.0

Thread 1 (process 2182):
#0  0x00007f72c894254d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f72c893f1b2 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#2  0x0000000018136fcb in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, un
signed int) ()
#3  0x0000000018134c57 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister(bool) ()
#4  0x00000000181342b2 in unw_step ()
#5  0x000000000db2d579 in StackTrace::StackTrace(ucontext_t const&) ()
#6  0x000000000db5290d in DB::(anonymous namespace)::writeTraceInfo(DB::TraceType, int, siginfo_t*, void*) ()
#7  <signal handler called>
#8  0x00000000111e6210 in DB::deserializeBinarySSE2<1> ()
#9  0x0000000011176cd4 in DB::ISerialization::deserializeBinaryBulkWithMultipleStreams(COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, DB::ISerialization::DeserializeBinaryBul
kSettings&, std::__1::shared_ptr<DB::ISerialization::DeserializeBinaryBulkState>&, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char
> >, COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic
_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<ch
ar> > const, COW<DB::IColumn>::immutable_ptr<DB::IColumn> > > >*) const ()
#10 0x0000000012ffbd54 in DB::NativeReader::read() ()
#11 0x0000000012eaa9d9 in DB::Connection::receiveDataImpl(DB::NativeReader&) ()
#12 0x0000000012ea9971 in DB::Connection::receivePacket() ()
#13 0x0000000012ee4de8 in DB::PacketReceiver::Task::run(std::__1::function<void (int, Poco::Timespan, DB::AsyncEventTimeoutType, std::__1::basic_string<char, std::__1::char_traits<char>, st
d::__1::allocator<char> > const&, unsigned int)>, std::__1::function<void ()>) ()
#14 0x00000000110ce963 in void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, FiberStack&, Fiber::RoutineImpl<DB::AsyncTaskExecutor::Routine
> > >(boost::context::detail::transfer_t) ()
#15 0x000000000693d8ef in make_fcontext ()
#16 0x0000000000000000 in ?? ()
[Inferior 1 (process 2182) detached]

Algunenano commented 3 weeks ago

The trace seem to be stuck in libunwind code when writing traces. Try deactivating the query profiler

ppavlov39 commented 2 weeks ago

We disabled the trace log 2 weeks ago and the cluster has been running without any issues since then.

ClickHouse / ClickHouse

Unexpectedly stuck queries after upgrade from 24.1.2.5 to 24.8.4.13 #69904