apache / cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
https://cloudberry.apache.org
Apache License 2.0
469 stars 108 forks source link

[Bug] count(*) query crashed and db in recovery mode #212

Open liyxbeijing opened 1 year ago

liyxbeijing commented 1 year ago

Cloudberry Database version

warehouse=# select version(); version

PostgreSQL 14.4 (Cloudberry Database 1.4.0 build commit:e83e3ffc22d538deb2dbceeeae0138ca2de064e6) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-b it compiled on Aug 3 2023 10:15:47 (1 row)

What happened

count(*) query crashed:

# \d+ zzz_admin.aaa
                                         Table "zzz_admin.aaa"
    Column     |               Type                | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
---------------+-----------------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 table_catalog | information_schema.sql_identifier |           |          |         | plain    |             |              | 
 table_schema  | information_schema.sql_identifier |           |          |         | plain    |             |              | 
 table_name    | information_schema.sql_identifier |           |          |         | plain    |             |              | 
 dt            | text                              |           |          |         | extended |             |              | 
Distributed by: (table_catalog)
Access method: heap

=# select count(*) from zzz_admin.aaa;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?> \q

logs:

2023-09-21 08:40:37.498183 CST,"gpadmin","",p433642,th-891975552,"[local]",,2023-09-21 08:38:56 CST,0,con21392,cmd29,seg-1,,,,sx1,"LOG","00000","statement: select count(*) from zzz_admin.aaa;",,,,,,,0,,"postgres.c",1708,
2023-09-21 08:40:37.551299 CST,,,p433642,th123456,,,2023-09-21 08:38:56 CST,0,con21392,cmd29,seg-1,,,,,"PANIC","XX000","Unexpected internal error: Master process received signal SIGSEGV",,,,,,,0,,,,"1    0x7fb5c8cc6630 libpthread.so.0 <symbol not found> + 0xc8cc6630
2    0xd5cc6c postgres DirectFunctionCall1Coll (fmgr.c:806)
"
2023-09-21 08:40:37.651110 CST,,,p370720,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","server process (PID 433642) was terminated by signal 11: Segmentation fault","Failed process was running: select count(*) from zzz_admin.aaa;",,,,,,0,,"postmaster.c",4229,
2023-09-21 08:40:37.651133 CST,,,p370720,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",3959,
2023-09-21 08:40:37.652496 CST,,,p75086,th-891975552,,,,0,,,seg-1,,,,,"WARNING","01000","ic-proxy-server: received signal 3",,,,,,,0,,"ic_proxy_main.c",474,
2023-09-21 08:40:37.652519 CST,"gpadmin","",p434360,th-891975552,"[local]",,2023-09-21 08:40:37 CST,0,,,seg-1,,,,,"FATAL","57P03","the database system is in recovery mode","last replayed record at 1C/27F1B788",,,,,,0,,"postmaster.c",2746,
2023-09-21 08:40:37.652569 CST,,,p370720,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","background worker ""ic proxy process"" (PID 75086) exited with exit code 1",,,,,,,0,,"postmaster.c",4208,
2023-09-21 08:40:37.656272 CST,"gpadmin",,p434361,th-891975552,"","18994",2023-09-21 08:40:37 CST,0,,,seg-1,,,,,"FATAL","57P03","the database system is in recovery mode","last replayed record at 1C/27F1B788",,,,,,0,,"postmaster.c",2746,
2023-09-21 08:40:37.665056 CST,,,p370720,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4515,
2023-09-21 08:40:37.831100 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","database system was interrupted; last known up at 2023-09-21 08:38:41 CST",,,,,,,0,,"xlog.c",6816,
2023-09-21 08:40:37.831129 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","Synchronization of the wal directory starts.",,,,,,,0,,"fd.c",3446,
2023-09-21 08:40:37.831191 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","synchronization of the wal directory finishes.",,,,,,,0,,"fd.c",3448,
2023-09-21 08:40:37.831701 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,0,,"xlog.c",7385,
2023-09-21 08:40:37.831701 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,0,,"xlog.c",7385,
2023-09-21 08:40:37.974973 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","redo starts at 1C/4035D810",,,,,,,0,,"xlog.c",7674,
2023-09-21 08:40:37.980902 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","invalid record length at 1C/40918430: wanted 24, got 0",,,,,,,0,,"xlog.c",4482,
2023-09-21 08:40:37.980911 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","redo done at 1C/40918260 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s",,,,,,,0,,"xlog.c",7962,
2023-09-21 08:40:38.265962 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","end of transaction log location is 1C/40918430",,,,,,,0,,"xlog.c",8051,
2023-09-21 08:40:38.281916 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","latest completed transaction id is 1719112 and next transaction id is 1719113",,,,,,,0,,"xlog.c",8449,
2023-09-21 08:40:38.281968 CST,,,p434363,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready",,,,,,,0,,"xlog.c",8476,
2023-09-21 08:40:38.285848 CST,,,p370720,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","PostgreSQL 14.4 (Cloudberry Database 1.4.0 build commit:e83e3ffc22d538deb2dbceeeae0138ca2de064e6) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Aug  3 2023 10:15:10",,,,,,,0,,"postmaster.c",3538,
2023-09-21 08:40:38.285861 CST,,,p370720,th-891975552,,,,0,,,seg-1,,,,,"LOG","00000","database system is ready to accept connections","PostgreSQL 14.4 (Cloudberry Database 1.4.0 build commit:e83e3ffc22d538deb2dbceeeae0138ca2de064e6) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11), 64-bit compiled on Aug  3 2023 10:15:10",,,,,,0,,"postmaster.c",3540,
2023-09-21 08:40:38.291444 CST,,,p434368,th-891975552,,,,0,con1,,seg-1,,,,sx1,"LOG","00000","initialized 1 resource queues",,,,,,,0,,"resscheduler.c",267,
2023-09-21 08:40:38.362261 CST,,,p434368,th-891975552,,,,0,con1,,seg-1,,,,sx1,"LOG","00000","Crash recovery broadcast of the distributed transaction 'Commit Prepared' broadcast succeeded for gid = 2015839.",,,,,,,0,,"cdbdtxrecovery.c",84,
2023-09-21 08:40:38.364541 CST,,,p434368,th-891975552,,,,0,con1,,seg-1,,,,sx1,"LOG","00000","DTM Started",,,,,,,0,,"cdbdtxrecovery.c",155,
2023-09-21 08:40:38.368227 CST,,,p434374,th-891975552,,,,0,con4,,seg-1,,,,,"LOG","00000","pg_cron scheduler started",,,,,,,0,,"pg_cron.c",376,
2023-09-21 08:40:42.656687 CST,"gpadmin",,p434423,th-891975552,"","19002",2023-09-21 08:40:42 CST,0,,,seg-1,,,,sx1,"LOG","00000","rejecting TCP connection to master using internalconnection protocol",,,,,,,0,,"auth.c",570,
2023-09-21 08:40:42.697845 CST,"gpadmin",,p434423,th-891975552,"","19002",2023-09-21 08:40:42 CST,0,,,seg-1,,,,,"LOG","00000","standby ""gp_walreceiver"" is now a synchronous standby with priority 1",,,,,,"START_REPLICATION 1C/40000000 TIMELINE 1",0,,"syncrep.c",671,

What you think should happen instead

No response

How to reproduce

rerun the query.

Operating System

centos 7.9

Anything else

No response

Are you willing to submit PR?

Code of Conduct

congxuebin commented 12 months ago

Just a reminder, this issue was marked as priority high by Max.

gfphoenix78 commented 3 months ago

@liyxbeijing Could you provide the full step on repro? including creating table.

congxuebin commented 2 weeks ago

@liyxbeijing Max is paying attention to this issue. Would you please provide recreating steps?