Tencent / TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.
Other
1.38k stars 262 forks source link

Memory corruption during CPU intensive work #93

Open yazun opened 3 years ago

yazun commented 3 years ago

After roughly 10 hours of quite intensive memory-mostly data crunching (50-60% CPU load, zeroish IO load) we see a crash and a core as below:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: dr3_ops_cs36 surveys 192.168.168.154(34674) REMOTE SUBPLAN (coord4:1'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000008aff06 in CopyDataRowTupleToSlot (slot=slot@entry=0x1bcdda0, combiner=<optimized out>) at execRemote.c:1843
1843    execRemote.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-46.el7.x86_64 libcom_err-1.42.9-17.el7.x86_64 libselinux-2.5-15.el7.x86_64 libxml2-2.9.1-6.el7.4.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00000000008aff06 in CopyDataRowTupleToSlot (slot=slot@entry=0x1bcdda0, combiner=<optimized out>) at execRemote.c:1843
#1  0x00000000008b3e72 in FetchTuple (combiner=combiner@entry=0x1bcd7e0) at execRemote.c:2144
#2  0x00000000008bd728 in ExecRemoteSubplan (pstate=0x1bcd7e0) at execRemote.c:10744
#3  0x0000000000904acc in ExecProcNode (node=0x1bcd7e0) at ../../../src/include/executor/executor.h:273
#4  fetch_input_tuple (aggstate=aggstate@entry=0x1bcd018) at nodeAgg.c:725
#5  0x000000000091354d in agg_retrieve_direct (aggstate=<optimized out>) at nodeAgg.c:3312
#6  ExecAgg (pstate=<optimized out>) at nodeAgg.c:3022
#7  0x0000000000906672 in ExecProcNode (node=0x1bcd018) at ../../../src/include/executor/executor.h:273
#8  ExecMaterial (pstate=0x1bccca8) at nodeMaterial.c:134
#9  0x000000000091cd7c in ExecProcNode (node=0x1bccca8) at ../../../src/include/executor/executor.h:273
#10 ExecNestLoop (pstate=0x1bbb020) at nodeNestloop.c:170
#11 0x00000000009480df in ExecProcNode (node=0x1bbb020) at ../../../src/include/executor/executor.h:273
#12 ExecutePlan (execute_once=<optimized out>, dest=0x1898f18, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1bbb020, estate=0x1bb9c08) at execMain.c:1955
#13 standard_ExecutorRun (queryDesc=0x19e6c18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:465
#14 0x00000000006d034e in AdvanceProducingPortal (portal=portal@entry=0x19e3398, can_wait=can_wait@entry=0 '\000') at pquery.c:2592
#15 0x00000000006d2f27 in PortalRun (portal=0x19e3398, count=<optimized out>, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x19968e8, altdest=0x19968e8, completionTag=0x7ffe096d9730 "") at pquery.c:1344
#16 0x0000000000705d53 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x19964d8 "p_7_4a39_3_137b0456") at postgres.c:2958
#17 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:5507
#18 0x000000000079c4ed in BackendRun (port=0x18898b0) at postmaster.c:4979
#19 BackendStartup (port=0x18898b0) at postmaster.c:4651
#20 ServerLoop () at postmaster.c:1956
#21 0x000000000079d366 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1564
#22 0x0000000000497c53 in main (argc=5, argv=0x1855680) at main.c:228
(gdb)

It happened already twice, so seems like a high probable scenario - it happens with no RAM strain.

and the offending part seems to be coming from a corrupted pocket? offending line

datarow = (RemoteDataRow) palloc(sizeof(RemoteDataRowData) + combiner->currentRow->msglen);
(gdb) p combiner->currentRow->msglen
value has been optimized out
(gdb) up
#1  0x00000000008b3e72 in FetchTuple (combiner=combiner@entry=0x1bcd7e0) at execRemote.c:2144
2144    in execRemote.c
(gdb) p *combiner
$4 = {ss = {ps = {type = T_RemoteSubplanState, plan = 0x188d7d8, state = 0x1bb9c08, ExecProcNode = 0x8bd660 <ExecRemoteSubplan>, ExecProcNodeReal = 0x8bd660 <ExecRemoteSubplan>, instrument = 0x0, worker_instrument = 0x0, qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleSlot = 0x1bcdda0, ps_ExprContext = 0x1c92218, ps_ProjInfo = 0x0, skip_data_mask_check = 0 '\000', audit_fga_qual = 0x0}, ss_currentRelation = 0x0,
    ss_currentScanDesc = 0x0, ss_ScanTupleSlot = 0x0, ss_currentMaskDesc = 0x0, inited = 0 '\000'}, node_count = 0, connections = 0x1bce528, conn_count = 1, current_conn = 0, current_conn_rows_consumed = 1, combine_type = COMBINE_TYPE_NONE, command_complete_count = 11, request_type = REQUEST_TYPE_QUERY, tuple_desc = 0x0, description_count = 0, copy_in_count = 0, copy_out_count = 0, copy_file = 0x0, processed = 0, errorCode = "\000\000\000\000", errorMessage = 0x0,
  errorDetail = 0x0, errorHint = 0x0, returning_node = 0, currentRow = 0xf3, rowBuffer = 0x7f7f988e05f8, tapenodes = 0x0, tapemarks = 0x7f7f988e07c8, prerowBuffers = 0x0, dataRowBuffer = 0x0, dataRowMemSize = 0x7f7f988e0898, nDataRows = 0x0, tmpslot = 0x0, errorNode = 0x0, backend_pid = 0, is_abort = 0 '\000', merge_sort = 0 '\000', extended_query = 1 '\001', probing_primary = 0 '\000', tuplesortstate = 0x0, remoteCopyType = REMOTE_COPY_NONE, tuplestorestate = 0x0,
  cursor = 0x7f7f98bc0fe8 "p_7_4a39_2_137b044d", update_cursor = 0x0, cursor_count = 12, cursor_connections = 0x7f7f988e01d8, recv_node_count = 12, recv_tuples = 0, recv_total_time = -1, DML_processed = 0, conns = 0x0, ccount = 0, recv_datarows = 0}
(gdb) p combiner->currentRow->msglen
Cannot access memory at address 0xf7
(gdb) p *combiner->currentRow
Cannot access memory at address 0xf3

Any idea if this could be fixed?

The queries are similar and involve index lookups, q3c index and lateral join + aggregates within lateral.

yazun commented 3 years ago

There are roughly 250 processes running, so around 700-800 active processes per datanode.

yazun commented 3 years ago

I should also mention that we use nonstandard blocksize (16KB) and both sender_thread_batch_size and sender_thread_buffer_size set to 64.

yazun commented 3 years ago

We had another crash under the same load, but at different place:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: dr3_ops_cs36 surveys 192.168.168.154(20672) REMOTE SUBPLAN (coord10:'.
Program terminated with signal 11, Segmentation fault.
#0  pfree (pointer=<optimized out>, pointer=<optimized out>, pointer=<optimized out>) at mcxt.c:1027
1027    mcxt.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install R-core-3.6.0-1.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-46.el7.x86_64 libcom_err-1.42.9-17.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgfortran-4.8.5-39.el7.x86_64 libgomp-4.8.5-39.el7.x86_64 libicu-50.2-4.el7_7.x86_64 libquadmath-4.8.5-39.el7.x86_64 libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7.4.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openblas-Rblas-0.3.3-2.el7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 pcre2-10.23-2.el7.x86_64 readline-6.2-11.el7.x86_64 tre-0.8.0-18.20140228gitc2f5d13.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  pfree (pointer=<optimized out>, pointer=<optimized out>, pointer=<optimized out>) at mcxt.c:1027
#1  0x00000000009038d9 in heap_freetuple (htup=<optimized out>) at heaptuple.c:1827
#2  ExecClearTuple (slot=0x2450830) at execTuples.c:499
#3  0x0000000000931ced in ExecEndCteScan (node=0x2450320) at nodeCtescan.c:291
#4  ExecEndNode (node=<optimized out>) at execProcnode.c:741
#5  0x0000000000931dcf in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#6  ExecEndNestLoop (node=0x2356620) at nodeNestloop.c:396
#7  ExecEndNode (node=<optimized out>) at execProcnode.c:764
#8  0x0000000000931dcf in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#9  ExecEndNestLoop (node=0x23577d8) at nodeNestloop.c:396
#10 ExecEndNode (node=<optimized out>) at execProcnode.c:764
#11 0x0000000000939960 in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#12 ExecEndPlan (estate=0x2355958, planstate=<optimized out>) at execMain.c:1823
#13 standard_ExecutorEnd (queryDesc=0x2117688) at execMain.c:597
#14 0x00000000009978fc in PortalCleanup (portal=0x2113e08) at portalcmds.c:398
#15 0x00000000005143be in MarkPortalFailed (portal=<optimized out>, portal=<optimized out>, portal=<optimized out>) at portalmem.c:542
#16 0x00000000006d3698 in PortalRun (portal=0x2113e08, count=0, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x22170e8, altdest=0x22170e8, completionTag=0x7fff24850a80 "") at pquery.c:1510
#17 0x0000000000705d53 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x2216cd8 "p_2_9333_4_d80da54") at postgres.c:2958
#18 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:5507
#19 0x000000000079c4ed in BackendRun (port=0x2118700) at postmaster.c:4979
#20 BackendStartup (port=0x2118700) at postmaster.c:4651
#21 ServerLoop () at postmaster.c:1956
#22 0x000000000079d366 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1564
#23 0x0000000000497c53 in main (argc=5, argv=0x20d36b0) at main.c:228