Open yazun opened 3 years ago
There are roughly 250 processes running, so around 700-800 active processes per datanode.
I should also mention that we use nonstandard blocksize (16KB) and both sender_thread_batch_size and sender_thread_buffer_size set to 64.
We had another crash under the same load, but at different place:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: dr3_ops_cs36 surveys 192.168.168.154(20672) REMOTE SUBPLAN (coord10:'.
Program terminated with signal 11, Segmentation fault.
#0 pfree (pointer=<optimized out>, pointer=<optimized out>, pointer=<optimized out>) at mcxt.c:1027
1027 mcxt.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install R-core-3.6.0-1.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-46.el7.x86_64 libcom_err-1.42.9-17.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgfortran-4.8.5-39.el7.x86_64 libgomp-4.8.5-39.el7.x86_64 libicu-50.2-4.el7_7.x86_64 libquadmath-4.8.5-39.el7.x86_64 libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7.4.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openblas-Rblas-0.3.3-2.el7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 pcre2-10.23-2.el7.x86_64 readline-6.2-11.el7.x86_64 tre-0.8.0-18.20140228gitc2f5d13.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 pfree (pointer=<optimized out>, pointer=<optimized out>, pointer=<optimized out>) at mcxt.c:1027
#1 0x00000000009038d9 in heap_freetuple (htup=<optimized out>) at heaptuple.c:1827
#2 ExecClearTuple (slot=0x2450830) at execTuples.c:499
#3 0x0000000000931ced in ExecEndCteScan (node=0x2450320) at nodeCtescan.c:291
#4 ExecEndNode (node=<optimized out>) at execProcnode.c:741
#5 0x0000000000931dcf in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#6 ExecEndNestLoop (node=0x2356620) at nodeNestloop.c:396
#7 ExecEndNode (node=<optimized out>) at execProcnode.c:764
#8 0x0000000000931dcf in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#9 ExecEndNestLoop (node=0x23577d8) at nodeNestloop.c:396
#10 ExecEndNode (node=<optimized out>) at execProcnode.c:764
#11 0x0000000000939960 in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#12 ExecEndPlan (estate=0x2355958, planstate=<optimized out>) at execMain.c:1823
#13 standard_ExecutorEnd (queryDesc=0x2117688) at execMain.c:597
#14 0x00000000009978fc in PortalCleanup (portal=0x2113e08) at portalcmds.c:398
#15 0x00000000005143be in MarkPortalFailed (portal=<optimized out>, portal=<optimized out>, portal=<optimized out>) at portalmem.c:542
#16 0x00000000006d3698 in PortalRun (portal=0x2113e08, count=0, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x22170e8, altdest=0x22170e8, completionTag=0x7fff24850a80 "") at pquery.c:1510
#17 0x0000000000705d53 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x2216cd8 "p_2_9333_4_d80da54") at postgres.c:2958
#18 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:5507
#19 0x000000000079c4ed in BackendRun (port=0x2118700) at postmaster.c:4979
#20 BackendStartup (port=0x2118700) at postmaster.c:4651
#21 ServerLoop () at postmaster.c:1956
#22 0x000000000079d366 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1564
#23 0x0000000000497c53 in main (argc=5, argv=0x20d36b0) at main.c:228
After roughly 10 hours of quite intensive memory-mostly data crunching (50-60% CPU load, zeroish IO load) we see a crash and a core as below:
It happened already twice, so seems like a high probable scenario - it happens with no RAM strain.
and the offending part seems to be coming from a corrupted pocket? offending line
Any idea if this could be fixed?
The queries are similar and involve index lookups, q3c index and lateral join + aggregates within lateral.