Tencent / TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.
Other
1.38k stars 262 forks source link

CRITICAL ERROR: cache lookup failed for opclass 0 with 2.3.0 branch. #121

Open yazun opened 2 years ago

yazun commented 2 years ago

Hello, We deployed new binaries based on the latest v2.3.0-release branch.

Unfortunately, as in the subject, the new binaries fail to execute basic operations on the partitioned tables. Has anything changed at the binary/catalog level with the latest big commits? - We skimmed them but did not spot any system tables changes etc. This is a blocker for us of course, could you please help?

 \d source
ERROR:  cache lookup failed for opclass 0
(dr3_ops_cs36@gaiadb12i:55431) [surveys] > explain select s.* from source s limit 10;
ERROR:  cache lookup failed for opclass 0
LINE 1: explain select s.* from source s limit 10;
                                  ^

Non-partitioned tables seem to work ok.

yazun commented 2 years ago

It fails at relcache.c:1089:

(gdb) bt
#0  RelationBuildPartitionKey (relation=relation@entry=0x7f135e954e88) at utils/cache/relcache.c:1086
#1  0x0000000000b1f2c0 in RelationBuildDesc (targetRelId=<optimized out>, insertIt=insertIt@entry=1 '\001') at utils/cache/relcache.c:1635
#2  0x0000000000b20c72 in RelationIdGetRelation (relationId=<optimized out>) at utils/cache/relcache.c:2402
#3  0x0000000000beb910 in relation_open.constprop.1 (relationId=206355, lockmode=0) at access/heap/heapam.c:1322
#4  0x0000000000544b9a in relation_openrv_extended (missing_ok=1 '\001', lockmode=1, relation=0x2902d80) at access/heap/heapam.c:1424
#5  heap_openrv_extended (relation=0x2902d80, lockmode=1, missing_ok=1 '\001') at access/heap/heapam.c:1545
#6  0x0000000000668c57 in parserOpenTable (pstate=0x2903090, relation=0x2902d80, lockmode=<optimized out>) at parser/parse_relation.c:1185
#7  0x0000000000668f6a in addRangeTableEntry (pstate=0x2903090, relation=0x2902d80, alias=0x0, inh=<optimized out>, inFromCl=<optimized out>) at parser/parse_relation.c:1319
#8  0x0000000000651464 in transformTableEntry (r=0x2902d80, pstate=0x2903090) at parser/parse_clause.c:1240
#9  transformFromClauseItem (pstate=0x2903090, n=0x2902d80, top_rte=0x7ffd0c44a258, top_rti=0x7ffd0c44a24c, namespace=0x7ffd0c44a250) at parser/parse_clause.c:1240
#10 0x00000000006533a7 in transformFromClause (pstate=0x2903090, frmList=<optimized out>) at parser/parse_clause.c:195
#11 0x000000000062d054 in transformSelectStmt (pstate=0x2903090, stmt=0x2902e60) at parser/analyze.c:1670
#12 0x0000000000630ce5 in transformStmt (pstate=0x2903090, parseTree=0x2902e60) at parser/analyze.c:404
#13 0x0000000000c03e7e in transformOptionalSelectInto (parseTree=0x2902e60, pstate=0x2903090) at parser/analyze.c:349
#14 transformTopLevelStmt (parseTree=0x2903010, pstate=0x2903090) at parser/analyze.c:299
#15 parse_analyze (queryEnv=0x0, numParams=0, paramTypes=0x0, sourceText=0x2902228 "select * from source limit 1;", parseTree=0x2903010) at parser/analyze.c:219
#16 pg_analyze_and_rewrite.constprop.0 (parsetree=0x2903010, query_string=0x2902228 "select * from source limit 1;", queryEnv=0x0, numParams=0, paramTypes=0x0) at tcop/postgres.c:881
#17 0x00000000009aa95f in exec_simple_query (query_string=0x2902228 "select * from source limit 1;") at tcop/postgres.c:1377
#18 0x00000000009ba1df in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at tcop/postgres.c:5456
#19 0x00000000008f9ca8 in BackendRun (port=0x2849ae0) at postmaster/postmaster.c:4982
#20 BackendStartup (port=0x2849ae0) at postmaster/postmaster.c:4654
#21 ServerLoop () at postmaster/postmaster.c:1959
#22 0x00000000008fab9c in PostmasterMain (argc=<optimized out>, argv=0x281e990) at postmas

...

p *opclass
$5 = {vl_len_ = 112, ndim = 1, dataoffset = 0, elemtype = 26, dim1 = 1, lbound1 = 0, values = 0x7f135e9514e0}
(gdb) p *opclass-> values
$6 = 0

p *relation
$7 = {rd_node = {spcNode = 0, dbNode = 0, relNode = 0}, rd_smgr = 0x0, rd_refcnt = 0, rd_backend = -1, rd_islocaltemp = 0 '\000', rd_isnailed = 0 '\000', rd_isvalid = 0 '\000', rd_indexvalid = 0 '\000', rd_statvalid = 0 '\000', rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x7f135e954298, rd_att = 0x2a37ee8, rd_id = 206355, rd_lockInfo = {lockRelId = {relId = 0, dbId = 0}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0, rd_cls_struct = 0x0, rd_fkeylist = 0x0,
  rd_fkeyvalid = 0 '\000', rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0, rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0, rd_indextuple = 0x0, rd_amhandler = 0, rd_indexcxt = 0x0, rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0, rd_supportinfo = 0x0,
  rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x0, rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x0, rd_locator_info = 0x7f135e954b78, rd_partitions_info = 0x0, rd_lru_list_elem = {prev = 0x0, next = 0x0}}
yazun commented 2 years ago

None of the tables from pg_partitioned_table can be used. Leaf partitions are queryable though. I hope no migration is needed to get them to work?

yazun commented 2 years ago
 select relname,p.* from pg_partitioned_table p  join pg_Class c  on (partrelid = c.oid) where relname = 'source';
 relname | partrelid | partstrat | partnatts | partattrs | partclass | partcollation | partexprs
---------+-----------+-----------+-----------+-----------+-----------+---------------+-----------
 source  |    206355 | r         |         1 | 2         | 3124      | 0             | [null]
JennyJennyChen commented 2 years ago

I'm very sorry, I didn't express it clearly. If you need to use the new features of the V2.3.0 partition table, you cannot directly replace the binary package, and you must reinstall it now. Because the metadata of many system tables has been changed, for example: pg_proc.h has added many partition table related operation functions, pg_partitioned_table.h has added the partdefid attribute and so on.

Since the new features of PG partitions are only available in the upgrade of the major version, the upgrade of the major version will inevitably bring about changes in the metadata, which needs to be reinstalled. Therefore, the new partition feature of TBase also needs to be reinstalled. I am very sorry that we can't solve this.

If you want to verify the resolution of https://github.com/Tencent/TBase/issues/108 , https://github.com/Tencent/TBase/issues/106 , you can directly use the latest v2.2.0 code for binary replacement just

yazun commented 2 years ago

Thank you for the confirmation! I noticed the new partdefid attribute added to pg_partitioned_table and was wondering if it would be enough to add but since the pg_proc.h has changed it would not be possible obviously.

We will have to plan this better then to do the migration.

yazun commented 2 years ago

As a side question: could pg_upgrade work eventually?

yazun commented 2 years ago

And if we could use logical replication to publish from V 2.2 to 2.3?

JennyJennyChen commented 2 years ago

1、pg_upgrade cannot be used on upgrades from 2.2.0 to 2.3.0. 2、logical replication can be used to publish from V 2.2 to 2.3. However, it can only be used for user data synchronization of hash shard type tables, and DDL cannot be supported.

yazun commented 2 years ago

Thank you for the clarifications.

Thinking about pg_proc.h changes: if pg_proc.h was changed only in an incremental way (so no old-new oid conflicts exist), should the the binary replacement be possible? Of course system tables change would be needed, at least discussed pg_partitioned_table.

JennyJennyChen commented 2 years ago

yes. If only pg_proc.h was changed only in an incremental way(so no old-new oid conflicts exist), and there is no other metadata modification, theoretically, it can be upgraded by replacing the binary package, but the part involving the newly incremental proc cannot work normally after the upgrade.

yazun commented 2 years ago

Maybe we'd consider as this is a much easier path that to do a full migration: to rewrite the patch by making the new oids incremental and change the order of the new column in the pg_partitioned_table by appending to it instead of inserting in the middle? Could you confirm this should be the only change to bring the binary compatibility back?

JennyJennyChen commented 2 years ago

pg_amproc.h was also changed in an incremental way. We have not verified the method you mentioned, and there may be unpredictable risks. You can test whether it is OK with a small amount of data