Tencent / TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.
Other
1.38k stars 262 forks source link

"relation ... deleted while still in use" error when using immutable lookup functions in create table as #83

Closed yazun closed 3 years ago

yazun commented 3 years ago

Continuing #82

We must rely on

create table tbl as select ... 

in an important scenario.

There is regression in comparison to XL which causes some immutable lookup functions to cause the error:

ERROR: XX000 - node:datanode1, backend_pid:39522, nodename:datanode1,backend_pid:39522,message:relation 523409537 deleted while still in use

Ie. this works:

create table r distribute by replication as select i i, i || '_val'::text itxt from generate_series(1,5) i;
create table r2 distribute by replication as select i i, i || '_val'::text itxt from generate_series(1,5) i;

CREATE OR REPLACE FUNCTION getr_text(id integer)
 RETURNS text
 LANGUAGE sql
 IMMUTABLE
 parallel safe
AS $function$
    select itxt from r2 where i = id;
$function$;

create table s (id int primary key, val text) distribute by shard(id) ;
insert into s  select id, id|| '_val' idval from generate_series(1,10000000) id;

-- this works
create  table mv_r_s_1 as
select s.*,r.*, getr_text(r.i),  getr_text(s.id) d2 from
r right outer join s on (getr_text(r.i) = s.val 
limit 0;

--- However if we use a function that does lookup on some well established table, we get the above error:

create  table mv_r_s_2 as
select s.*,r.*, getr_text(r.i),  getr_text(s.id) d2 from
r right outer join s on (getr_text(r.i) = s.val or 
dr3_ops_cs36_tmp.getmaincatalog_tst() = i % 4) --<---- this immutable function causes error
;

The function is very similar to getr_text - except is run on an older, replicated table.

You can see immutable function query to be shipped to datanode in the log at DEBUG5 level attached here.

bug_xz_create_as.txt

q2683252 commented 3 years ago

I run your code on our enviroument which works fine. Can you provide the definition of dr3_ops_cs36_tmp.getmaincatalog_tst and the query explained?

q2683252 commented 3 years ago

relation 523409537 deleted while still in use Is dr3_ops_cs36_tmp.getmaincatalog_tst using a table that dropped during execution phase?

q2683252 commented 3 years ago

To avoid the problem,you can try split create and select , such as create table a ( a int); insert into a select generate_series(1,1000);

yazun commented 3 years ago

I run your code on our enviroument which works fine. Can you provide the definition of dr3_ops_cs36_tmp.getmaincatalog_tst and the query explained?

Yes, The query also work fine without create table as .. this is why it's puzzling. The function is using a table like r in the above example, but the table is much older - this would be the main difference I guess.

create  table mv_r_s_1 as
surveys-# select s.*,r.*, getr_text(r.i),  getr_text(s.id) d2 from
surveys-# r right outer join s on (getr_text(r.i) = s.val or dr3_ops_cs36_tmp.getmaincatalog_tst() = i % 4)
surveys-# ;
                                                                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..715488.47 rows=34139 width=136)
   Join Filter: ((getr_text(r.i) = s.val) OR ('2'::smallint = (r.i % 4)))
   ->  Remote Subquery Scan on all (datanode1,datanode10,datanode11,datanode12,datanode2,datanode3,datanode4,datanode5,datanode6,datanode7,datanode8,datanode9)  (cost=100.00..276.65 rows=1850 width=36)
         ->  Seq Scan on s  (cost=0.00..104.50 rows=1850 width=36)
   ->  Materialize  (cost=100.00..282.20 rows=1850 width=36)
         ->  Remote Subquery Scan on all (datanode1)  (cost=100.00..276.65 rows=1850 width=36)
               ->  Seq Scan on r  (cost=0.00..104.50 rows=1850 width=36)
(7 rows)

the function causing problems is as simple as

CREATE OR REPLACE FUNCTION dr3_ops_cs36_tmp.getmaincatalog_tst()
 RETURNS smallint
 LANGUAGE sql
 immutable PARALLEL SAFE
AS $function$  select catalogid  from catalog where fowningcatalog and size > 10^7;  $function$
;

as long as there's reference to catalog table in this function, it breaks. It happens for other old tables...

explain  select catalogid  from catalog where fowningcatalog and size > 10^7;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: datanode1
   ->  Seq Scan on catalog  (cost=0.00..13.93 rows=1 width=2)
         Filter: (fowningcatalog AND ((size)::double precision > '10000000'::double precision))
(4 rows)

The catalog table is there untouched. It's a static table, replicated. Query without create as ... works ok.

Unfortunately it is impossible for us to create the table by hand as we do not know the table definition: this technique is used to bootstrap a table by scientists with various queries, doing joins on huge partitioned tables and we use heavy parallel querying to populate it.

There's a function in psql from PG11 to get the definition of query buffer, but not sure if we could just patch it https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=49ca462eb165dea297f1f110e8eac064308e9d51#patch2 what do you think? We could then parse it to create a table as you suggest.

Of course, fixing the core problem would be the best..

Thanks a lot for help!

yazun commented 3 years ago

btw. vacuum freeze of catalog table did not help.

q2683252 commented 3 years ago

I run your code on our enviroument which works fine. Can you provide the definition of dr3_ops_cs36_tmp.getmaincatalog_tst and the query explained?

Yes, The query also work fine without create table as .. this is why it's puzzling. The function is using a table like r in the above example, but the table is much older - this would be the main difference I guess.

create  table mv_r_s_1 as
surveys-# select s.*,r.*, getr_text(r.i),  getr_text(s.id) d2 from
surveys-# r right outer join s on (getr_text(r.i) = s.val or dr3_ops_cs36_tmp.getmaincatalog_tst() = i % 4)
surveys-# ;
                                                                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..715488.47 rows=34139 width=136)
   Join Filter: ((getr_text(r.i) = s.val) OR ('2'::smallint = (r.i % 4)))
   ->  Remote Subquery Scan on all (datanode1,datanode10,datanode11,datanode12,datanode2,datanode3,datanode4,datanode5,datanode6,datanode7,datanode8,datanode9)  (cost=100.00..276.65 rows=1850 width=36)
         ->  Seq Scan on s  (cost=0.00..104.50 rows=1850 width=36)
   ->  Materialize  (cost=100.00..282.20 rows=1850 width=36)
         ->  Remote Subquery Scan on all (datanode1)  (cost=100.00..276.65 rows=1850 width=36)
               ->  Seq Scan on r  (cost=0.00..104.50 rows=1850 width=36)
(7 rows)

the function causing problems is as simple as

CREATE OR REPLACE FUNCTION dr3_ops_cs36_tmp.getmaincatalog_tst()
 RETURNS smallint
 LANGUAGE sql
 immutable PARALLEL SAFE
AS $function$  select catalogid  from catalog where fowningcatalog and size > 10^7;  $function$
;

as long as there's reference to catalog table in this function, it breaks. It happens for other old tables...

explain  select catalogid  from catalog where fowningcatalog and size > 10^7;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Remote Fast Query Execution  (cost=0.00..0.00 rows=0 width=0)
   Node/s: datanode1
   ->  Seq Scan on catalog  (cost=0.00..13.93 rows=1 width=2)
         Filter: (fowningcatalog AND ((size)::double precision > '10000000'::double precision))
(4 rows)

The catalog table is there untouched. It's a static table, replicated. Query without create as ... works ok.

Unfortunately it is impossible for us to create the table by hand as we do not know the table definition: this technique is used to bootstrap a table by scientists with various queries, doing joins on huge partitioned tables and we use heavy parallel querying to populate it.

There's a function in psql from PG11 to get the definition of query buffer, but not sure if we could just patch it https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=49ca462eb165dea297f1f110e8eac064308e9d51#patch2 what do you think? We could then parse it to create a table as you suggest.

Of course, fixing the core problem would be the best..

Thanks a lot for help!

Thanks for your detailed instruction. The issue occured in our database , We will try to solve it as soon as possible.

Btw, When your database will go online ? I wonder we have enough time to solve it.

yazun commented 3 years ago

Good news it is repeatable on your side!! Thanks a lot!

In principle we are ready to switch at any moment, we have two systems running in parallel (XL and TBase). We are to start the final test to make a stress test over this weekend and if successful it would be end of the next week.

For now, for this particular problem we patched psql with the above-mentioned \gdesc switch and finishing amending scripts, so hopefully we have a dirty workaround soon..

q2683252 commented 3 years ago

Good News,Problem solved,it's on master branch.

yazun commented 3 years ago

Much appreciated! Will test asap and will close the issue once validated.

yazun commented 3 years ago

Confirming it works ok! perfect!