acl-dev / acl

C/C++ server and network library, including coroutine,redis client,http/https/websocket,mqtt, mysql/postgresql/sqlite client with C/C++ for Linux, Android, iOS, MacOS, Windows, etc..
https://acl-dev.cn
GNU Lesser General Public License v3.0
2.83k stars 937 forks source link

redis pipeline 偶尔崩溃的问题,疑似用 mbox wait 不保证强同步 #293

Closed lymslive closed 1 year ago

lymslive commented 1 year ago

最近做个 pipeline 压力测试,频繁崩溃在 redis_pipeline_channel 线程:

#0  acl_atomic_set (self=0x0, value=0xffffffff00000000) at src/stdlib/acl_atomic.c:103
#1  0x00007f02ec1b6b32 in acl_ypipe_flush (self=0x7f02c80cdab0) at src/stdlib/common/acl_ypipe.c:74
#2  0x00007f02ec17c422 in acl_mbox_send (mbox=0x7f02c80cda70, msg=msg@entry=0x7f02c80cd9d0) at src/stdlib/acl_mbox.c:118
#3  0x00007f02ec0e5409 in acl::mbox_send (mbox=<optimized out>, o=o@entry=0x7f02c80cd9d0) at src/stdlib/mbox.cpp:23
#4  0x00007f02ec10d0c2 in acl::mbox<acl::redis_pipeline_message>::push (dummy=false, t=0x7f02c80cd9d0, this=<optimized out>) at ./include/acl_cpp/stdlib/mbox.hpp:82
#5  acl::redis_pipeline_message::push (result=<optimized out>, this=<optimized out>) at ./include/acl_cpp/redis/redis_client_pipeline.hpp:135
#6  acl::redis_pipeline_channel::wait_one (this=this@entry=0x7f02e00399e0, conn=..., msg=...) at src/redis/redis_client_pipeline.cpp:174
#7  0x00007f02ec10d47f in acl::redis_pipeline_channel::wait_results (this=0x7f02e00399e0) at src/redis/redis_client_pipeline.cpp:193
#8  0x00007f02ec10d5d0 in acl::redis_pipeline_channel::handle_messages (this=this@entry=0x7f02e00399e0) at src/redis/redis_client_pipeline.cpp:223
#9  0x00007f02ec10da62 in acl::redis_pipeline_channel::run (this=0x7f02e00399e0) at src/redis/redis_client_pipeline.cpp:271
#10 0x00007f02ec144954 in acl::thread::thread_run (arg=0x7f02e00399e0) at src/stdlib/thread.cpp:102
#11 0x00007f02ee1beea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f02e9ac3b0d in clone () from /lib64/libc.so.6

用 valgrind 跑了下,发现可能是 redis_pipeline_message 提前释放了:

··· ==540== Invalid read of size 8 ==540== at 0x8E80AF1: acl_ypipe_flush (acl_ypipe.c:73) ==540== by 0x8E46411: acl_mbox_send (acl_mbox.c:118) ==540== by 0x8DAF408: acl::mbox_send(void, void) (mbox.cpp:23) ==540== by 0x8DD70C1: push (mbox.hpp:82) ==540== by 0x8DD70C1: push (redis_client_pipeline.hpp:135) ==540== by 0x8DD70C1: acl::redis_pipeline_channel::wait_one(acl::socket_stream&, acl::redis_pipeline_message&) (redis_client_pipeline.cpp:174) ==540== by 0x8DD747E: acl::redis_pipeline_channel::wait_results() (redis_client_pipeline.cpp:193) ==540== by 0x8DD75CF: acl::redis_pipeline_channel::handle_messages() (redis_client_pipeline.cpp:223) ==540== by 0x8DD7A61: acl::redis_pipeline_channel::run() (redis_client_pipeline.cpp:271) ==540== by 0x8E0E943: acl::thread::thread_run(void) (thread.cpp:102) ==540== by 0x6E10EA4: start_thread (in /usr/lib64/libpthread-2.17.so) ==540== by 0xB547B0C: clone (in /usr/lib64/libc-2.17.so) ==540== Address 0x21060a68 is 24 bytes inside a block of size 72 free'd ==540== at 0x4C2ACBD: free (vg_replace_malloc.c:530) ==540== by 0x8E455CE: acl_free_glue (acl_malloc_glue.c:298) ==540== by 0x8E462EA: acl_mbox_free (acl_mbox.c:100) ==540== by 0x8DDBCBB: ~mbox (mbox.hpp:70) ==540== by 0x8DDBCBB: ~redis_pipeline_message (redis_client_pipeline.hpp:65) ==540== by 0x8DDBCBB: acl::redis_command::~redis_command() (redis_command.cpp:110) ==540== by 0x7E02596: acl::redis::~redis() (redis.hpp:75) ==540== by 0x7E02641: acl::redis::~redis() (redis.hpp:75) ==540== by 0x7E1B250: RedisPipelineCmd::~RedisPipelineCmd() (RedisPipeline.cpp:113) ==540== by 0x4F00DD2: CarStatusRedisWriter::DealSaveRedisDataList(bool) (CarStatusRedisWriter.cpp:422) ==540== by 0x4F00505: CarStatusRedisWriter::SaveRedisThreadFun(void) (CarStatusRedisWriter.cpp:285) ==540== by 0x9A3CBC3: UtilThread::ThreadFun(void*) (UtilThread.cpp:229) ==540== by 0x6E10EA4: start_thread (in /usr/lib64/libpthread-2.17.so) ==540== by 0xB547B0C: clone (in /usr/lib64/libc-2.17.so) ==540== Block was alloc'd at ==540== at 0x4C29BC3: malloc (vg_replace_malloc.c:299) ==540== by 0x8E343DA: acl_default_malloc (acl_default_malloc.c:230) ==540== by 0x8E344FE: acl_default_calloc (acl_default_malloc.c:253) ==540== by 0x8E80A10: acl_ypipe_new (acl_ypipe.c:25) ==540== by 0x8E46244: acl_mbox_create2 (acl_mbox.c:77) ==540== by 0x8DDC888: mbox (mbox.hpp:64) ==540== by 0x8DDC888: redis_pipeline_message (redis_client_pipeline.hpp:51) ==540== by 0x8DDC888: acl::redis_command::get_pipeline_message() (redis_command.cpp:234) ==540== by 0x8DDD88D: acl::redis_command::build_request(unsigned long, char const*, unsigned long) (redis_command.cpp:965) ==540== by 0x7E1B5BA: RedisPipelineCmd::BuildRequest(std::vector<std::string, std::allocator >&) (RedisPipeline.cpp:154) ==540== by 0x7E1B151: RedisPipeline::PushCommand(std::vector<std::string, std::allocator >&) (RedisPipeline.cpp:91) ==540== by 0x4F00BF3: CarStatusRedisWriter::DealSaveRedisDataList(bool) (CarStatusRedisWriter.cpp:403) ==540== by 0x4F00505: CarStatusRedisWriter::SaveRedisThreadFun(void) (CarStatusRedisWriter.cpp:285) ==540== by 0x9A3CBC3: UtilThread::ThreadFun(void) (UtilThread.cpp:229) ···

大部分堆栈都在 acl 库里。我自己在外面封装写的测试程序因为也有点复杂,暂不便全部帖出。但总体思路上是参数示例 acl-3.5.3-15/lib_acl_cpp/samples/redis/redis_pipeline2/redis_pipeline.cpp 。 程序正常流程下会在调用 redis_pipeline_message::wait 后再删除 redis 对象及其内含的 redis_pipeline_message :

// redis_pipeline_message::wait 当前实现
    const redis_result* wait(void) {
        if (mbox_) {
            mbox_->pop();
        } else {
            tbox_->pop();
        }
        return result_;
    }

按我阅读源码的理解,这里 mbox 的 pop 应该是同步阻塞的吧,会等 redis 结果返回调用 push ?但奇怪的是,偶尔出现这种上述崩溃的情况,wait 返回了,把 pipeline message 析构了,但仍有 channel 线程尝试往已析构的 message push 结果,导致崩溃。

我暂未深入 mbox 的实现代码,所以请问一下它的 push 与 pop 是否能保证强同步,在多线程下是否有隐晦 bug ?

测试环境: centos (容器) gcc7.

另外发现一个可能源码中与此崩溃无关的小 bug ,redis_pipeline_message::set_request (110 行):

        argc_ = argc;
        for (size_t i = 0; i < size_; i++) {  // <-- here 
            argv_[i] = argv[i];
            lens_[i] = lens[i];
        }

循环赋值中,应该是 argc_ 而不是 size_ , 否则及参 argc 小于 size_ 会出现越界读(在 valgrind 观测到报告)……虽然并不影响后面组包执行正确的 redis 命令。

zhengshuxin commented 1 year ago

redis_pipeline_message 对象是跨线程通过mbox共享操作,如果 redis_command中wait操作提前返回而pipeline channel后到再push结果的话,确实可能存在msg对象被提前释放的问题,下一步我准备加一个原子性的引用计数(类似于 std::shared_ptr)以避免此种情况;至于 wait 为何提前返回,有可能与 redis_client_pipeline.cpp 中 174 行中遇到与 redis 服务通信出错导致的。 另外,上面提的 for 循环的变量确实有问题,谢谢提出。

lymslive commented 1 year ago

感觉 redis_client_pipeline.cpp:174 行不是原因,而是结果吧。再次对已释放的 reids_command 对象调用 wait_one ,数据完全不对了,才返回未知错误。正常情况下,即使从服务端返回未知错误,那 174 行也是 push 了一个结果,在这之后,mbox 的 pop 才能返回,也即让 msg 的 wait 返回。而这个 msg 在 push 结果后,也会从相应的 channel 线程中 clearn 掉,不会再 push 第二次。 所以我还是怀疑 mbox 有没可能在没有 push 的情况就 pop 返回了,才导致 channel 线程对已释放对象执行 push 。

另外请教一下如何开启 acl 库内部的日志系统啊,我尝试先执行了 acl::log::stdout_open(true); ,但也没有观察到上一行 redis_client_pipeline.cpp:173 那个 logger_error 。

zhengshuxin commented 1 year ago

我又分析了一下 mbox 底层 C 的实现逻辑,在 acl_ypipe.c 的 acl_ypipe_flush() 中73行成功后再调用74行之间,如果消费者收到了消息并提前释放了 mbox 对象,则在上面的74行再赋值时可能就是非法访问了。

zhengshuxin commented 1 year ago

我给 redis_pipeline_message 增加了引用计数 ,只有当引用计数为0时对象才会释放,这样该对象在生产者-消费者两个线程中共享就是安全的了,你可以再用你的例子压测一下。

lymslive commented 1 year ago

更新了你的修复,不会复现之前的崩溃了。谢谢!