apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.66k stars 3.27k forks source link

Compaction producer hang and none compaction consumer works #5308

Closed acelyc111 closed 3 years ago

acelyc111 commented 3 years ago

Describe the bug I found some BE's compactiom score metric not update anymore, I doubt the compaction not works, and use pstack to show all threads, there is no "compaction" words except the compaction producer thread. And then use gcore to dump the process, the producer's stack:

(gdb) t 260
[Switching to thread 260 (Thread 0x7f499f6fd700 (LWP 5470))]
#0  0x00007f4a21bd36d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f4a21bd36d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000263a23c in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x4cf3e88) at /var/local/gcc/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:864
#2  std::condition_variable::wait (this=this@entry=0x4cf3e88, __lock=...) at ../../../.././libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x0000000000e61bf3 in wait<doris::CompactionPermitLimiter::request(int64_t)::<lambda()> > (__p=..., __lock=..., this=0x4cf3e88) at /usr/include/c++/7.3.0/condition_variable:99
#4  doris::CompactionPermitLimiter::request (this=this@entry=0x4cf3e50, permits=permits@entry=2792) at /root/doris-env/doris/be/src/olap/compaction_permit_limiter.cpp:30
#5  0x0000000000dd28be in doris::StorageEngine::_compaction_tasks_producer_callback (this=0x4cf3c00) at /root/doris-env/doris/be/src/olap/olap_server.cpp:354
#6  0x0000000000dd31ac in operator() (__closure=<optimized out>) at /root/doris-env/doris/be/src/olap/olap_server.cpp:93
#7  std::_Function_handler<void(), doris::StorageEngine::start_bg_threads()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/7.3.0/bits/std_function.h:316
#8  0x0000000001117718 in operator() (this=0x6ce84a68) at /usr/include/c++/7.3.0/bits/std_function.h:706
#9  doris::Thread::supervise_thread (arg=0x6ce84a50) at /root/doris-env/doris/be/src/util/thread.cpp:304
#10 0x00007f4a21bcfdc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f4a21edb73d in clone () from /lib64/libc.so.6
(gdb) f 4
#4  doris::CompactionPermitLimiter::request (this=this@entry=0x4cf3e50, permits=permits@entry=2792) at /root/doris-env/doris/be/src/olap/compaction_permit_limiter.cpp:30
30  /root/doris-env/doris/be/src/olap/compaction_permit_limiter.cpp: 没有那个文件或目录.
(gdb) p _used_permits
$1 = {_value = 0}
(gdb) p permits
$2 = 2792
(gdb) p config::total_permits_for_compaction_score
$3 = 2000
(gdb)

That means the wait condition is satisfied but the cv is still wait, so there must be some multi-threads sync problems.

To Reproduce Steps to reproduce the behavior: Very occasional to happen.

Expected behavior All threads works well.

wangbo commented 3 years ago

code version: master

bool CompactionPermitLimiter::request(int64_t permits) {
    DorisMetrics::instance()->compaction_waitting_permits->set_value(permits);
    if (permits > config::total_permits_for_compaction_score) {
        // when tablet's compaction score is larger than "config::total_permits_for_compaction_score",
        // it's necessary to do compaction for this tablet because this tablet will not get "permits"
        // anyway. otherwise, compaction task for this tablet will not be executed forever.
        std::unique_lock<std::mutex> lock(_permits_mutex);
        _permits_cv.wait(lock, [=] {
            return _used_permits == 0 ||
                   _used_permits + permits <= config::total_permits_for_compaction_score;
        });
    } else {
        if (_used_permits + permits > config::total_permits_for_compaction_score) {
            std::unique_lock<std::mutex> lock(_permits_mutex);
            _permits_cv.wait(lock, [=] {
                return _used_permits + permits <= config::total_permits_for_compaction_score;
            });
        }
    }
    _used_permits += permits;  // here _used_permits is updated out of lock, Does this have concurrency issues ?
    DorisMetrics::instance()->compaction_waitting_permits->set_value(0);
    DorisMetrics::instance()->compaction_used_permits->set_value(_used_permits);
    return true;
}

I have a question as in the comment above.

weizuo93 commented 3 years ago

code version: master

bool CompactionPermitLimiter::request(int64_t permits) {
    DorisMetrics::instance()->compaction_waitting_permits->set_value(permits);
    if (permits > config::total_permits_for_compaction_score) {
        // when tablet's compaction score is larger than "config::total_permits_for_compaction_score",
        // it's necessary to do compaction for this tablet because this tablet will not get "permits"
        // anyway. otherwise, compaction task for this tablet will not be executed forever.
        std::unique_lock<std::mutex> lock(_permits_mutex);
        _permits_cv.wait(lock, [=] {
            return _used_permits == 0 ||
                   _used_permits + permits <= config::total_permits_for_compaction_score;
        });
    } else {
        if (_used_permits + permits > config::total_permits_for_compaction_score) {
            std::unique_lock<std::mutex> lock(_permits_mutex);
            _permits_cv.wait(lock, [=] {
                return _used_permits + permits <= config::total_permits_for_compaction_score;
            });
        }
    }
    _used_permits += permits;  // here _used_permits is updated out of lock, Does this have concurrency issues ?
    DorisMetrics::instance()->compaction_waitting_permits->set_value(0);
    DorisMetrics::instance()->compaction_used_permits->set_value(_used_permits);
    return true;
}

I have a question as in the comment above.

@wangbo _used_permits is defined as a type of AtomicInt64.

wangbo commented 3 years ago

code version: master

bool CompactionPermitLimiter::request(int64_t permits) {
    DorisMetrics::instance()->compaction_waitting_permits->set_value(permits);
    if (permits > config::total_permits_for_compaction_score) {
        // when tablet's compaction score is larger than "config::total_permits_for_compaction_score",
        // it's necessary to do compaction for this tablet because this tablet will not get "permits"
        // anyway. otherwise, compaction task for this tablet will not be executed forever.
        std::unique_lock<std::mutex> lock(_permits_mutex);
        _permits_cv.wait(lock, [=] {
            return _used_permits == 0 ||
                   _used_permits + permits <= config::total_permits_for_compaction_score;
        });
    } else {
        if (_used_permits + permits > config::total_permits_for_compaction_score) {
            std::unique_lock<std::mutex> lock(_permits_mutex);
            _permits_cv.wait(lock, [=] {
                return _used_permits + permits <= config::total_permits_for_compaction_score;
            });
        }
    }
    _used_permits += permits;  // here _used_permits is updated out of lock, Does this have concurrency issues ?
    DorisMetrics::instance()->compaction_waitting_permits->set_value(0);
    DorisMetrics::instance()->compaction_used_permits->set_value(_used_permits);
    return true;
}

I have a question as in the comment above.

@wangbo _used_permits is defined as a type of AtomicInt64.

I understand.