apache / cloudberry

One advanced and mature open-source MPP (Massively Parallel Processing) database. Open source alternative to Greenplum Database.
https://cloudberry.apache.org
Apache License 2.0
430 stars 104 forks source link

Adding the ability to create a non-distributed table (can be used for postgres extensions) #645

Open visill opened 1 month ago

visill commented 1 month ago

Change logs

Add: create table table_name DISTRIBUTED BY LOCAL Creates a table that is queried only on the coordinator

Why are the changes needed?

Some postgres extensions (sr_plan or yezzey) needs to create local table. Before we do this by C-functions. Now we can do using SQL syntax

Does this PR introduce any user-facing change?

No

How was this patch tested?

It has regress test

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

CLAassistant commented 1 month ago

CLA assistant check
All committers have signed the CLA.

wenchaozhang-123 commented 1 month ago

Should add test such as explain and execution for join with other distribution table.

reshke commented 1 month ago

Should add test for join with other distribution table.

Maybe, but main reason for having LOCAL relation is that they act exactly like pg_catalog relations (so, used only on coordinator node).

reshke commented 1 month ago

Should add test for join with other distribution table.

Maybe, but main reason for having LOCAL relation is that they act exactly like pg_catalog relations (so, used only on coordinator node).

for example, save-restore plan (sr_plan) extension uses relation to store query plans. Designed originally for postgres, this extension does not work well in Greenplum (or its forks), because it tries to perform distributed query each time we access sr_plans table.

https://github.com/yezzey-gp/ygp/blob/YGP_6.27_STABLE/gpcontrib/sr_plan/sr_plan--1.1--1.2.sql#L5-L15

wenchaozhang-123 commented 1 month ago

Squash commit message?

gfphoenix78 commented 1 month ago

Hi, @visill , thank you for your interesting about cloudberry. Could you give more details about why we should change the current data model? Coordinator in cloudberry inherits from greenplum. It doesn't process data scan, and tries to move the workload to the segments, so the database has the ability to scale to a large data processing system. Local table seems violates our design, we need more discussion about why it's necessary.

Cloudberry has many extensions that were backported from PostgreSQL. Generally, our principle is to adapt the extensions to our distributed database, not the other way around.

avamingli commented 1 month ago

Cloudberry has many extensions that were backported from PostgreSQL. Generally, our principle is to adapt the extensions to our distributed database, not the other way around.

Absolutely right.

This is wrong way, you need to make your extensions compatible with CBDB kernel codes. Greenplum/Cloudberry store catalog on master, other tables should be on segments in MPP mode. Could you explain more why need create table on master?

Why your extensions need to store data on master, if it must be , it's a catalog. If it needs to work only on master, you need to change your extensions with Gp_role to do that. A lot of extensions are modified to be compatible in MPP like that.

reshke commented 1 month ago

Could you explain more why need create table on master?

Because we need to store and access data on master, during planning phase. At the first glance it is about sr_plan extension. But turns out, it is not only about that.

Why your extensions need to store data on master, if it must be , it's a catalog. If it needs to work only on master, you need to change your extensions with Gp_role to do that. A lot of extensions are modified to be compatible in MPP like that.

While this is generally true, it is very unhandy to do that. Today I bumped into problems with another extension - pg_hint_plan. This extension also uses table to store its planner 'hints'. This table should persist on master exclusively.

So, the ability to force Greenplum to create relation without any distribution policy would help with at least two extensions now.

And yes, we can play with Gp_role inside extension installation script and all functions/sql/other that access this relation. Whilst it is achievable with sr_plan, it is too much work in other cases, especially for https://github.com/ossc-db/pg_hint_plan

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy). But specifying one proper option in CREATE TABLE sql is just much more simple.

reshke commented 1 month ago

If one dislikes SQL approach, we can do it another way: using GUC to control transformDistributedBy behaviour when no distrib clause specified

avamingli commented 1 month ago

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy

That is forbidden in GPDB, all tables should have a distribution policy , you are passing an invalid param with that function.

reshke commented 1 month ago

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy

That is forbidden in GPDB, all tables should have a distribution policy , you are passing an invalid param with that function.

It is impossible to forbid something for extension. extension install script is executed under superuser

avamingli commented 1 month ago

We also can execute heap_create_with_catalog from extension install script, to create all needed relation as we want (with NULL gp_policy

That is forbidden in GPDB, all tables should have a distribution policy , you are passing an invalid param with that function.

It is impossible to forbid something for extension. extension install script is executed under superuser

I'm not talking about the permissions. Superuser doesn't mean volatile our MPP design is right. It's terrible that tables are created without distribution policy entry in MPP, and you intend to use that bug.