Compatible with Postgres system catalog

tisonkun commented 5 months ago

What type of enhancement is this?

User experience

What does the enhancement do?

Since we want to support users to access GreptimeDB with psql, we'd better be more compatible when possible.

Currently, when you run psql -l against GreptimeDB, it would fail with:

psql: ERROR:  Failed to plan SQL: Error during planning: Table not found: greptime.pg_catalog.pg_database

Implementation challenges

This should be similar to MySQL's information

tisonkun commented 5 months ago

Not quite easy. Seems related to some internal functions also:

Failed to execute query: SELECT d.datname as "Name",
       pg_catalog.pg_get_userbyid(d.datdba) as "Owner",
       pg_catalog.pg_encoding_to_char(d.encoding) as "Encoding",
       pg_catalog.array_to_string(d.datacl, '\n') AS "Access privileges"
FROM pg_catalog.pg_database d
ORDER BY 1; err=0: Table operation error, at src/frontend/src/instance.rs:272:14
1: Failed to plan statement, at /Users/tison/GreptimeWorkspace/greptimedb/src/operator/src/statement.rs:253:14
2: Failed to plan SQL, at src/query/src/planner.rs:90:14
...

tisonkun commented 5 months ago

And this is the query for "show tables" (\dt):

Failed to execute query: SELECT n.nspname as "Schema",
  c.relname as "Name",
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','p','')
      AND n.nspname <> 'pg_catalog'
      AND n.nspname !~ '^pg_toast'
      AND n.nspname <> 'information_schema'
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2; err=0: Table operation error, at src/frontend/src/instance.rs:272:14
1: Failed to plan statement, at /Users/tison/GreptimeWorkspace/greptimedb/src/operator/src/statement.rs:253:14
2: Failed to plan SQL, at src/query/src/planner.rs:90:14

yihong0618 commented 5 months ago

I did some research about this before, it needs a lot work, and most of them had been done by risingwave, maybe you can use some code from them, and make less effort for this.

FYI: you can refer this issue. some of them can easily add in greptime others maybe need to add in pgwire

https://github.com/risingwavelabs/risingwave/issues/2954

killme2008 commented 5 months ago

Yes, we are doing a lot of work to be compatible with MySQL, pg is our next goal.

yihong0618 commented 5 months ago

Yes, we are doing a lot of work to be compatible with MySQL, pg is our next goal.

One thing came to my mind that databend, risgingwave and greptime all using rust and all maybe have the need to implement protocols risingwave mainly for pg, and databend mainly for mysql, maybe you can extract something common and help each other, maybe its a triple-win thing.

sundy-li commented 5 months ago

Let's add pgwrite into databend!

J0HN50N133 commented 5 months ago

I would like to get involved in this work.😊

tisonkun commented 5 months ago

@J0HN50N133 Welcome! You can take a look at how we implement the "information_schema" tables as a reference.

But I'm also still unaware of how to add the necessary functions or work it around.

J0HN50N133 commented 4 months ago

@tisonkun BELOW IS ALSO THE PROPOSAL SUMITTED TO OSPP To support pg_catalog related features. The main challenge is to support the System Information Functions and Operators.

I will try to add a new schema called pg_catalog and related table for trivial cases like \d under src/catalog/pg_catalog and decide how to implement the system function according to what happen in DFLogicalPlanner. Related system functions may be implemented via udf.

Although SHOW TABLES and SHOW DATABASES can also work even under psql, pg users may still depend on pg_catalog

Goals

support normal system catalog related query(trivial statements first)
pg_catalog should be optional, users shouldn't pay for this if they don't need pg. Data should be able to build from information_schema.

System Catalogs implementation plan

In the current version, tables following are possible to support:

The first three tables will be implemented first since they are basis.

Implementation

pg_catalog management

add a new schema called pg_catalog and related table for trivial cases like \d under src/catalog/pg_catalog. Create a new SchemaProvider called PgCatalogProvider and mount it under SystemCatalog. Then while finding table for query, catalog_manager will first check system_catalog. https://github.com/GreptimeTeam/greptimedb/blob/65d47bab5607b60aa25c50f57e53b762fe979c68/src/catalog/src/kvbackend/manager.rs#L252
Each System table is implemented as a struct under pg_catalog/. To make them work as a Table, we just need to do some adaption between those implementation of pg_catalog.tables and Table.
Since view is supported recently in #3807 , view related catalog function should also be supported. The id of view is also implemented as TableId in #3807. So we could reuse TableId as oid for view as we will do for table.
There is currently no concept of Namespace in greptime; to workaround, it can be considered equivalent to a database, i.e. each database contains only one namespace. To provide a comparison, the data organization in greptime and postgres is as follows: Pg: Database -> Namespace(Schema) -> Table Gt: Catalog -> Database(Schema) -> Table (we can't create new Catalog currently)

pg_catalog related system function

Implement them with datafusion udf. Like:

https://github.com/GreptimeTeam/greptimedb/blob/578dd8f87a2897fa586e5833fead18cc24d1912a/src/common/function/src/function_registry.rs#L90

J0HN50N133 commented 2 months ago

I'll manage to provide similar features in #2931.

Tasks

[ ] pg_database
[x] pg_namespace
[x] pg_class
[x] pg_collation, not supported yet, just like the mysql ver. did
[x] pg_trigger, we don't support it.
[x] pg_type
[ ] pg_partitioned_table
[ ] pg_statistic

J0HN50N133 commented 1 month ago

While supporting \dt, i.e. the following sql statement, I met some problems:

SELECT n.nspname as "Schema",
  c.relname as "Name",
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 't' THEN 'TOAST table' WHEN 'f' THEN 'foreign table' WHEN 'p' THEN 'partitioned table' WHEN 'I' THEN 'partitioned index' END as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','p','')
      AND n.nspname <> 'pg_catalog'
      AND n.nspname !~ '^pg_toast'
      AND n.nspname <> 'information_schema'
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;

n.oid = c.relnamespace, this predicate need a namespace.oid: u32 to check equality, which can be seen as schema's id in greptime. However, there is no such thing like schema_id in current SchemaManager. https://github.com/J0HN50N133/greptimedb/blob/906abd61de5e6da4e2ae5bba3f13fea567509166/src/common/meta/src/key/schema_name.rs#L54 Should we add this field in SchemaManager? This may break backward compatibility.
pg_get_userbyid: it seems that user authentication is not supported in the open source codebase. Should I just return empty string?

@tisonkun WDYT?

tisonkun commented 1 month ago

@J0HN50N133 Thanks for pushing forward this ticket!

pg_get_userbyid: it seems that user authentication is not supported in the open source codebase. Should I just return empty string?

Yeah. This is reasonable as a first implementation. cc @shuiyisong @zyy17 later we may consider how to integrate it with the plugin system to allow populate user info in this case.

n.oid = c.relnamespace

Adding ID for schema can be a breaking change and introduce overload burden to overcome in this issue.

cc @killme2008 @waynexia please take a look at this information schema related logics. A workaround can be using the schema name as "oid" in this place.

J0HN50N133 commented 1 month ago

@tisonkun

A workaround can be using the schema name as "oid" in this place.

The shortcomings is the potential incompatibility with existing pg eco.

tisonkun commented 1 month ago

The shortcomings is the potential incompatibility with existing pg eco.

Yeah. Some how the type is different, but comparing should work.

cc @killme2008 @waynexia what's the affect if we try to add id to each schema?

J0HN50N133 commented 1 month ago

Next I will try to support dbeaver connection to greptime, via pg protocol. The following sql is the statements fail.

SELECT typcategory FROM pg_catalog.pg_type WHERE 1<>1 LIMIT 1

SELECT t.oid,t.*,c.relkind,NULL as base_type_name, d.description
FROM pg_catalog.pg_type t
LEFT OUTER JOIN pg_catalog.pg_class c ON c.oid=t.typrelid
LEFT OUTER JOIN pg_catalog.pg_description d ON t.oid=d.objoid
WHERE t.typname IS NOT NULL
AND (c.relkind IS NULL OR c.relkind = 'c')

sunng87 commented 3 weeks ago

@J0HN50N133 I just found that tables like pg_class, pg_namespace are hidden tables that available under all schemas, which means some client can run sql like:

SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'public' AND c.relkind in ('r', 'p')

without calling pg_catalog.pg_class.

One idea is to resolve those table like pg_class, pg_namespace and etc. under any schema to pg_catalog.*

J0HN50N133 commented 3 weeks ago

@J0HN50N133 I just found that tables like pg_class, pg_namespace are hidden tables that available under all schemas, which means some client can run sql like:
SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'public' AND c.relkind in ('r', 'p')
without calling pg_catalog.pg_class.

One idea is to resolve those table like pg_class, pg_namespace and etc. under any schema to pg_catalog.*

@sunng87 Yeah, #4543 maybe also relate to this, for example, this hidden table should not appear in MySQL client. Maybe we modify DfTableSourceProvider to solve both problems.

GreptimeTeam / greptimedb