Tencent / TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.
Other
1.38k stars 262 forks source link

OLAP leading to SIGSEV based crashes #149

Open Dan-RAI opened 4 months ago

Dan-RAI commented 4 months ago

With set prefer_olap = 'on' we observe process crashes in running TPC-H benchmark queries (for instance Q2) already at scale factor 10 in parallel with more than 10 clients on a single coordinator. The time until occurance of a crash strongly reduces with the number of clients. With more than 200 we observe them already after a few seconds. (If useful, we can provide you directly with scripts to reproduce this issue.)

It seems that memory gets corrupted. During a crash, always the first element of the memory freelist points to a non-accessible region (here to 0x10):

freelist = {0x0, 
    0x10, 0x0, 0x0, 0x0, 0x7fca55abbfd0, 0x0, 0x0, 0x0, 0x23fae98, 0x0}

This results in a SIGSEV in the memory allocation.

Stack trace:

#0  AllocSetAlloc (context=0x238ef18, size=16) at aset.c:707
#1  0x0000000000990f78 in palloc (size=size@entry=16) at mcxt.c:935
#2  0x0000000000724bb4 in new_list (type=type@entry=T_IntList) at list.c:68
#3  0x0000000000724d45 in lappend_int (list=list@entry=0x0, datum=4) at list.c:151
#4  0x0000000000677d56 in ExecInitQual (qual=<optimized out>, parent=parent@entry=0x24d0378) at execExpr.c:206
#5  0x000000000069d432 in ExecInitIndexScan (node=node@entry=0x24151a0, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeIndexscan.c:931
#6  0x0000000000684f76 in ExecInitNode (node=0x24151a0, estate=estate@entry=0x23f65a0, eflags=1) at execProcnode.c:225
#7  0x00000000006a6418 in ExecInitNestLoop (node=node@entry=0x2414620, estate=estate@entry=0x23f65a0, eflags=<optimized out>, eflags@entry=1)
    at nodeNestloop.c:338
#8  0x00000000006850aa in ExecInitNode (node=0x2414620, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:298
#9  0x00000000006a63f6 in ExecInitNestLoop (node=node@entry=0x2414190, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeNestloop.c:333
#10 0x00000000006850aa in ExecInitNode (node=0x2414190, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:298
#11 0x00000000006a63f6 in ExecInitNestLoop (node=node@entry=0x24132d8, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeNestloop.c:333
#12 0x00000000006850aa in ExecInitNode (node=0x24132d8, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:298
#13 0x000000000069116b in ExecInitAgg (node=node@entry=0x24131c0, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeAgg.c:3911
#14 0x000000000068512e in ExecInitNode (node=0x24131c0, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:331
#15 0x00000000006df01a in ExecShutdownRemoteSubplan (node=node@entry=0x23f71d0) at execRemote.c:11373
#16 0x0000000000684e11 in ExecShutdownNode (node=0x23f71d0) at execProcnode.c:873
#17 0x00000000007247cf in planstate_tree_walker (planstate=planstate@entry=0x23f6bc8, walker=walker@entry=0x684d9d <ExecShutdownNode>, 
    context=context@entry=0x0) at nodeFuncs.c:3784
#18 0x0000000000684dc5 in ExecShutdownNode (node=0x23f6bc8) at execProcnode.c:856
#19 0x00000000007205b6 in planstate_walk_subplans (plans=<optimized out>, walker=walker@entry=0x684d9d <ExecShutdownNode>, context=context@entry=0x0)
    at nodeFuncs.c:3864
#20 0x0000000000724837 in planstate_tree_walker (planstate=planstate@entry=0x245f370, walker=walker@entry=0x684d9d <ExecShutdownNode>, 
    context=context@entry=0x0) at nodeFuncs.c:3844
#21 0x0000000000684dc5 in ExecShutdownNode (node=0x245f370) at execProcnode.c:856
#22 0x00000000007247cf in planstate_tree_walker (planstate=planstate@entry=0x245eed8, walker=walker@entry=0x684d9d <ExecShutdownNode>, 
    context=context@entry=0x0) at nodeFuncs.c:3784
#23 0x0000000000684dc5 in ExecShutdownNode (node=node@entry=0x245eed8) at execProcnode.c:856
#24 0x000000000067ee42 in ExecutePlan (estate=estate@entry=0x23f65a0, planstate=0x245eed8, use_parallel_mode=<optimized out>, 
    operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=1 '\001', numberTuples=numberTuples@entry=0, direction=ForwardScanDirection, 
    dest=0x22c3948, execute_once=1 '\001') at execMain.c:2063
#25 0x000000000067f0a9 in standard_ExecutorRun (queryDesc=0x2313a50, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at execMain.c:466
#26 0x000000000067f163 in ExecutorRun (queryDesc=queryDesc@entry=0x2313a50, direction=direction@entry=ForwardScanDirection, count=count@entry=0, 
    execute_once=<optimized out>) at execMain.c:409
#27 0x0000000000861ed9 in PortalRunSelect (portal=portal@entry=0x2260510, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, 
    dest=dest@entry=0x22c3948) at pquery.c:1722
#28 0x000000000086438a in PortalRun (portal=portal@entry=0x2260510, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
    run_once=<optimized out>, dest=dest@entry=0x22c3948, altdest=altdest@entry=0x22c3948, completionTag=0x7ffe6a2b6b50 "") at pquery.c:1362
#29 0x000000000085fb15 in exec_execute_message (portal_name=portal_name@entry=0x22c3530 "p_1_1dfd6c_2_79f38aea", max_rows=9223372036854775807, 
    max_rows@entry=0) at postgres.c:3065
#30 0x0000000000860c65 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x20853d0, dbname=<optimized out>, username=<optimized out>) at postgres.c:5645
#31 0x00000000007d3a48 in BackendRun (port=port@entry=0x20fb6b0) at postmaster.c:5034
#32 0x00000000007d5b3f in BackendStartup (port=port@entry=0x20fb6b0) at postmaster.c:4706
#33 0x00000000007d5d41 in ServerLoop () at postmaster.c:1963
#34 0x00000000007d7058 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x20835a0) at postmaster.c:1571
#35 0x000000000072052f in main (argc=5, argv=0x20835a0) at main.c:233

The database itself throws error messages like this:

ERROR:  Failed to receive more data from data node 16394
WARNING:  combiner is not prepared for instrumentation
WARNING:  pgxc_abort_connections dn node:dn6 invalid socket 4294967295!
ERROR:  node:dn2, backend_pid:4190542, nodename:dn1,backend_pid:3367739,message:Failed to receive more data from data node 16394
ERROR:  Failed to receive more data from data node 16394
WARNING:  pgxc_abort_connections dn node:dn6 invalid socket 4294967295!