apache / cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
https://cloudberry.apache.org
Apache License 2.0
415 stars 103 forks source link

[Bug] orca test cases failed due to server closed the connection unexpectedly #669

Open congxuebin opened 1 month ago

congxuebin commented 1 month ago

Cloudberry Database version

Cloudberry Database 1.7.0+dev.23.g200e3561 build 88554 commit:200e3561

What happened

+WARNING: terminating connection because of crash of another server process +server closed the connection unexpectedly

parallel group (8 tests): qp_executor qp_with_clause qp_olap_window qp_misc_jiras qp_olap_windowerr qp_bitmapscan qp_derived_table qp_dropped_cols qp_misc_jiras ... FAILED (test process exited with exit code 2) 903 ms (diff 714 ms) qp_with_clause ... FAILED (test process exited with exit code 2) 897 ms (diff 1164 ms) qp_executor ... ok 206 ms (diff 85 ms) qp_olap_windowerr ... FAILED (test process exited with exit code 2) 905 ms (diff 796 ms) qp_olap_window ... FAILED (test process exited with exit code 2) 902 ms (diff 8355 ms) qp_derived_table ... FAILED (test process exited with exit code 2) 913 ms (diff 13612 ms) qp_bitmapscan ... FAILED (test process exited with exit code 2) 911 ms (diff 1912 ms) qp_dropped_cols ... FAILED (test process exited with exit code 2) 916 ms (diff 1505 ms)

What you think should happen instead

No response

How to reproduce

make -k PGOPTIONS='-c optimizer=on' installcheck-good

Operating System

centos7

Anything else

No response

Are you willing to submit PR?

Code of Conduct

gfphoenix78 commented 4 weeks ago

@congxuebin could you provide more details about the crash?

congxuebin commented 3 weeks ago

@gfphoenix78 Hi Hao,

The crash occurred when creating table. But simply running the test case qp_misc_jiras won't recreate the problem. You can recreate thru the following test.

PGOPTIONS='-c optimizer=on'
cd /code/cbdb_src/src/test/regress/results
make installcheck-good
CREATE TABLE qp_misc_jiras.tbl1544_child_depth_1_y_2000_year (
CONSTRAINT tbl1544_child_depth_1_y_2000_year_pdate_check
CHECK (pdate >= '2000-01-01'::date AND pdate < '2001-01-01'::date))
INHERITS (qp_misc_jiras.tbl1544)
;
NOTICE:  table has parent, setting distribution columns to match parent table
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost

[Uploading output-issue-669.zip…]()

edespino commented 2 weeks ago

I also ran into this issue on Rocky Linux 8 & 9.

make installcheck PGOPTIONS='-c optimizer=on'

@my-ship-it - Do we know what change introduced this issue? Do we know when this will be fixed? We need to get this fixed as soon as possible.

my-ship-it commented 2 weeks ago

@gfphoenix78 Could you please help on it, thanks!

Smyatkin-Maxim commented 2 weeks ago

Just for the record:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=133989073607040) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=133989073607040) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=133989073607040, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x000079dcc3a42476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  0x000079dcc4b75c5f in StandardHandlerForSigillSigsegvSigbus_OnMainThread (processName=0x79dcc510ea43 "Master process", postgres_signal_arg=11) at elog.c:5377
#5  0x000079dcc49842ef in CdbProgramErrorHandler (postgres_signal_arg=11) at postgres.c:3812
#6  <signal handler called>
#7  0x000079dcc4dfc7d8 in gpopt::CExpression::DeriveHasNonScalarFunction (this=0x0) at CExpression.cpp:1512
#8  0x000079dcc4e0b99d in gpopt::CExpressionPreprocessor::PexprTransposeSelectAndProject (mp=0x59258040c620, pexpr=0x5925815de2b0) at CExpressionPreprocessor.cpp:2847
#9  0x000079dcc4e0b7c1 in gpopt::CExpressionPreprocessor::PexprTransposeSelectAndProject (mp=0x59258040c620, pexpr=0x5925815bc770) at CExpressionPreprocessor.cpp:2892
#10 0x000079dcc4e10b92 in gpopt::CExpressionPreprocessor::PexprPreprocess (mp=mp@entry=0x59258040c620, pexpr=pexpr@entry=0x592580a15cf0, pcrsOutputAndOrderCols=pcrsOutputAndOrderCols@entry=0x5925815b6760)
    at CExpressionPreprocessor.cpp:3179
#11 0x000079dcc4dc780f in gpopt::CQueryContext::CQueryContext (this=0x5925815b6640, mp=0x59258040c620, pexpr=0x592580a15cf0, prpp=<optimized out>, colref_array=0x5925815b5f60, 
    pdrgpmdname=<optimized out>, fDeriveStats=true) at CQueryContext.cpp:65
#12 0x000079dcc4dc7d9d in gpopt::CQueryContext::PqcGenerate (mp=mp@entry=0x59258040c620, pexpr=pexpr@entry=0x592580a15cf0, pdrgpulQueryOutputColRefId=<optimized out>, 
    pdrgpmdname=pdrgpmdname@entry=0x59258065c7b0, fDeriveStats=fDeriveStats@entry=true) at CQueryContext.cpp:259
#13 0x000079dcc4e6e25c in gpopt::COptimizer::PdxlnOptimize (mp=mp@entry=0x59258040c620, md_accessor=md_accessor@entry=0x7fffbee5ce30, query=query@entry=0x592580614f00, 
    query_output_dxlnode_array=query_output_dxlnode_array@entry=0x592580614c70, cte_producers=cte_producers@entry=0x592580734b00, pceeval=pceeval@entry=0x592580831d60, ulHosts=3, ulSessionId=315, 
    ulCmdId=34, search_stage_array=0x0, optimizer_config=0x592580615bb8, szMinidumpFileName=0x0) at COptimizer.cpp:297
#14 0x000079dcc4f6b016 in COptTasks::OptimizeTask (ptr=<optimized out>) at COptTasks.cpp:573
#15 0x000079dcc4c7930d in gpos::CTask::Execute (this=this@entry=0x5925805706c0) at CTask.cpp:130
#16 0x000079dcc4c7a2da in gpos::CWorker::Execute (this=0x7fffbee5d470, task=task@entry=0x5925805706c0) at CWorker.cpp:80
#17 0x000079dcc4c78a60 in gpos::CAutoTaskProxy::Execute (this=this@entry=0x7fffbee5d4a0, task=task@entry=0x5925805706c0) at CAutoTaskProxy.cpp:286
#18 0x000079dcc4c7aeb2 in gpos_exec (params=0x7fffbee5d530) at _api.cpp:237
#19 0x000079dcc4f692c7 in COptTasks::Execute (func=0x79dcc4f6ac90 <COptTasks::OptimizeTask(void*)>, func_arg=0x7fffbee5d5b0) at COptTasks.cpp:234
#20 0x000079dcc4f6a2cc in COptTasks::GPOPTOptimizedPlan (query=query@entry=0x5925805a8f90, gpopt_context=gpopt_context@entry=0x7fffbee5d5b0) at COptTasks.cpp:770
#21 0x000079dcc4f6c25f in CGPOptimizer::GPOPTOptimizedPlan (query=0x5925805a8f90, had_unexpected_failure=0x7fffbee5d647) at CGPOptimizer.cpp:58
#22 0x000079dcc4839340 in optimize_query (parse=0x592580839180, cursorOptions=2048, boundParams=0x0) at orca.c:160
#23 0x000079dcc481a488 in standard_planner (parse=0x592580839180, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at planner.c:392
#24 0x000079dcc481a33f in planner (parse=0x592580839180, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at planner.c:333
#25 0x000079dcc497f0c2 in pg_plan_query (querytree=0x592580839180, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at postgres.c:995
#26 0x000079dcc497f21e in pg_plan_queries (querytrees=0x59258048ff28, 
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"..., cursorOptions=2048, boundParams=0x0) at postgres.c:1087
#27 0x000079dcc4980d17 in exec_simple_query (
    query_string=0x5925801f7320 "with diversecountries as\n(select country.code,country.name,country.capital,d.CNT\n from country, \n (select countrylanguage.countrycode,count(*) as CNT from countrylanguage group by countrycode\n  HAVING"...) at postgres.c:1854
#28 0x000079dcc4986db6 in PostgresMain (argc=1, argv=0x7fffbee5dcb0, dbname=0x592580225ba0 "regression", username=0x592580225b80 "smiatkin") at postgres.c:5595
#29 0x000079dcc48a7a3f in BackendRun (port=0x592580218720) at postmaster.c:5126
#30 0x000079dcc48a719f in BackendStartup (port=0x592580218720) at postmaster.c:4830
#31 0x000079dcc48a26aa in ServerLoop () at postmaster.c:2051
#32 0x000079dcc48a1c06 in PostmasterMain (argc=7, argv=0x5925801f1b10) at postmaster.c:1676
#33 0x000059258001baaa in main (argc=7, argv=0x5925801f1b10) at main/main.c:270
leborchuk commented 1 week ago

I cannot checked if it helps or not because the issue does not reproduce in my dev env (still try to do it), but looked up the last commits and see that we cherry-picked Fix predicate pushdown using cast'd column (#13770)

but did not took [Fix qp_with_clause testcase without asserts (#13878)] (https://github.com/open-gpdb/gpdb/commit/fad65d796f7d2b7c17884d67ed7b79b216f11a71)

where bug in [CExpressionPreprocessor.cpp] line 2846 was fixed (see bt Maxim provided)

It looks like we should cherry-picked #13878 too

leborchuk commented 1 week ago

Added https://github.com/apache/cloudberry/pull/708 to launch tests while checking it in my env

gfphoenix78 commented 1 week ago

I cannot checked if it helps or not because the issue does not reproduce in my dev env (still try to do it), but looked up the last commits and see that we cherry-picked Fix predicate pushdown using cast'd column (#13770)

but did not took [Fix qp_with_clause testcase without asserts (#13878)] (open-gpdb/gpdb@fad65d7)

where bug in [CExpressionPreprocessor.cpp] line 2846 was fixed (see bt Maxim provided)

It looks like we should cherry-picked #13878 too

Thank you @leborchuk , I'll check whether it works with this patch. I doesn't repo this issue in my current env. Will test on other envs.

gfphoenix78 commented 1 week ago

I also ran into this issue on Rocky Linux 8 & 9.

make installcheck PGOPTIONS='-c optimizer=on'

@my-ship-it - Do we know what change introduced this issue? Do we know when this will be fixed? We need to get this fixed as soon as possible.

Hi, Ed, I couldn't repro this issue on my Rocky Linux 9. Could you repro the crash on your env? If yes, you may try @leborchuk 's PR https://github.com/apache/cloudberry/pull/708

edespino commented 1 week ago

@gfphoenix78

You should be able to reproduce the issue by building HEAD of main with the following configure options. You will need to update it for your environment. FYI: I build xerces-c from source instead of pulling from epel and that is why my configure command is the way it is.

        cd ~/cloudberry
        export LD_LIBRARY_PATH=/usr/local/cloudberry-db/lib:LD_LIBRARY_PATH
        ./configure --prefix=/usr/local/cloudberry-db \
                    -disable-external-fts \
                    --enable-gpcloud \
                    --enable-ic-proxy \
                    --enable-mapreduce \
                    --enable-orafce \
                    --enable-orca \
                    --enable-pxf \
                    --enable-tap-tests \
                    --with-gssapi \
                    --with-ldap \
                    --with-libxml \
                    --with-lz4 \
                    --with-openssl \
                    --with-pam \
                    --with-perl \
                    --with-pgport=5432 \
                    --with-python \
                    --with-pythonsrc-ext \
                    --with-ssl=openssl \
                    --with-openssl \
                    --with-uuid=e2fs \
                    --with-includes=/usr/local/xerces-c/include \
                    --with-libraries=/usr/local/cloudberry-db/lib | tee configure-$(date "+%Y.%m.%d-%H.%M.%S").log

Here is the command I use to execute installcheck:

make installcheck PGOPTIONS='-c optimizer=on' --directory=~/cloudberry