PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
23.12k stars 1.02k forks source link

Schema-based, incremental versioning #2166

Open wolfgangwalther opened 2 years ago

wolfgangwalther commented 2 years ago

I was recently pointed at https://begriffs.com/posts/2014-12-30-intro-to-postgrest.html, where Joe presented the core ideas that PostgREST was built on initially. From 8:30 on, he talks about schema-based versioning with fall-through. I think this is a great idea. I guess we have since moved on from the Accept: ...; version= parameter - and that's fine.

However, I'd like to pick up on the fall-through idea again and I think we can improve the schema-based versioning experience we currently have by far.


Right now, we support multiple schemas through Accept-Profile and Content-Profile headers. However, the strict separation between schemas in this case, makes the profile headers more useful for multi-tenancy setups and not for (incremental) versioning.

One other thing to consider is, that there are multiple different ways in how you can introduce breaking changes to your API - for which you then need api versioning: The most obvious, of course, is when you change the exposed schema in a breaking way. However, breaking changes can also be introduced by upgrading PostgREST itself to a new major version - or by just changing one of PostgREST's config options. To stay backwards compatible in those cases it might make sense to run multiple instances of PostgREST (older and newer versions) at the same time. In this case a versioning approach based on schema and authentication role is useful:

-- main authenticator role -grant all api roles to this role
-- this is not used for login directly anymore
CREATE ROLE authenticator;

-- this role is used for the PostgREST instance that serves api version 1
CREATE ROLE v1 IN ROLE authenticator LOGIN;
ALTER ROLE v1 SET pgrst.db_schemas TO 'v1';
ALTER ROLE v1 SET pgrst.raw_media_types TO 'text/plain';

CREATE SCHEMA v1;
-- ...

-- now we introduce some breaking changes in v2:
CREATE ROLE v2 IN ROLE authenticator LOGIN;
ALTER ROLE v2 SET pgrst.db_schemas TO 'v2';
-- this is a breaking change
ALTER ROLE v2 SET pgrst.raw_media_types TO '';

CREATE SCHEMA v2;
-- ...

We can run two PostgREST instances at the same time now, one of each will login with v1 or v2 respectively. I guess the most used setup would then be to map URI prefixes of /api/v1/... and /api/v2/... to those instances via nginx for example.

This should currently work just fine. I guess we could extend the documentation about this approach somewhere.


One thing that's annoying, though, is that the full api needs to be re-created in the v2 schema to be able to expose all paths.

If we could use incremental versioning instead, we could just create those objects that changed in the new schema and still rely on the old v1 objects for those that didn't change, so something like this:

CREATE VIEW v1.clients AS ...;
CREATE VIEW v1.projects AS ...;

-- we need to update `projects` in a breaking way, so we do:
CREATE VIEW v2.projects AS ...;

-- now, if we set the search_path properly, we can do:
SET search_path TO v2, v1;

SELECT * FROM clients; -- uses v1.clients
SELECT * FROM projects; -- uses v2.projects

This would be incremental versioning.

How can we achieve something like that?

We were discussing several ways of storing our own metadata about database objects inside the database (via comment, via some table + utility functions, ...). If we were to store something like extends: v1 for the v2 schema, we could still use the same syntax we currently use in db-schemas = v2 - just when querying for the schema cache, we could extend our search for exposable objects to that extended schema - recursively.

steve-chavez commented 2 years ago

Right now, we support multiple schemas through Accept-Profile and Content-Profile headers. However, the strict separation between schemas in this case, makes the profile headers more useful for multi-tenancy setups and not for (incremental) versioning.

Very true. We might as well map different Accept-Profile to different databases and do database-based multitenancy.

One thing that's annoying, though, is that the full api needs to be re-created in the v2 schema to be able to expose all paths. SET search_path TO v2, v1;

Yeah, duplicating db objects it's really bad. I also explored the search_path idea on https://github.com/PostgREST/postgrest/issues/1106 but IIRC we couldn't just leave the schema resolving to pg because the search_path also implicitly includes pg_catalog, which meant that the API would expose things like pg_roles.

How can we achieve something like that? We were discussing several ways of storing our own metadata about database objects inside the database (via comment, via some table + utility functions, ...).

I remember that(on https://github.com/PostgREST/postgrest/issues/2028#issuecomment-973832177), so it would be like:

COMMENT ON VIEW v2.tasks $$
---
pgrst:
  extends: v1
  date: 2022-02-24  
$$;

With dates perhaps we can also do something similar to Stripe versioning.

wolfgangwalther commented 2 years ago

Very true. We might as well map different Accept-Profile to different databases and do database-based multitenancy.

The Profile headers are still annoying to use - because of the different headers for GET and POST etc..

I also think that, since the headers describe the format of the body, they are not really suited for multi-tenancy. Multi-tenancy is about selecting different resources - not the same resources in a different body format.

Looking at it from a http perspective, I think the thing that maps to different databases the best is the hostname in the request. We are not using that at all, yet.

Schemas are best mapped to a path prefix, imho.

Yeah, duplicating db objects it's really bad. I also explored the search_path idea on #1106 but IIRC we couldn't just leave the schema resolving to pg because the search_path also implicitly includes pg_catalog, which meant that the API would expose things like pg_roles.

Hm. Maybe we can use the search_path only at schema cache creation time?

I remember a while back we discussed somewhere else to only expose all kinds of routes explicitely, not only RPCs, but also tables/views. I.e. if it's not in the schema cache, it will throw an error. If we did that, we could use the search path strategy to find exposed endpoints for the schema cache. So the schema cache for v2 would contain everything from v2 and v1.

I remember that(on #2028 (comment)), so it would be like:

COMMENT ON VIEW v2.tasks $$
---
pgrst:
  extends: v1
  date: 2022-02-24  
$$;

Uh, I wanted to put the metadata/comment on the SCHEMA, not each individual object. Adding a comment for all database objects would be quite cumbersome, I think.

With dates perhaps we can also do something similar to Stripe versioning.

In this case you would need a new schema for each change, to have a new date associated with it. If you do that, you don't need the v1 as the schema name anymore, but you should instead do something like this:

-- (using a shorter syntax for comments for clarity)
-- ...
CREATE SCHEMA "2022-02-24";
COMMENT ON SCHEMA "2022-02-24" $$extends: 2022-02-23$$;
CREATE SCHEMA "2022-02-25";
COMMENT ON SCHEMA "2022-02-25" $$extends: 2022-02-24$$;
ALTER ROLE v2 SET pgrst.db_schemas TO '2022-02-25, 2022-02-24, 2022-02-23';
-- ...

So basically every change is a new schema, but only for (major) breaking changes you'd create a new role, which maps to a new path prefix via reverse proxy.

And then, with what we have right now, you could use the profile headers to select the sub-version / schema to use for that request.

steve-chavez commented 1 year ago

Schemas are best mapped to a path prefix, imho.

I've realized that path based schemas do not conform to REST because schemas are an implementation detail(namespaces) that can change, if that happens it would break the URI.

wolfgangwalther commented 1 year ago

I've realized that path based schemas do not conform to REST because schemas are an implementation detail(namespaces) that can change, if that happens it would break the URI.

It's a matter of definition what constitutes an implementation detail and what doesn't. Once you make schemas part of the URI, they are not an implementation detail anymore of course. But that's not a problem. Schema names can change - but so can table names or column names. And those are no implementation details, either.

You can turn it around and look at it from the http perspective: Assume you'd like to create two separate endpoints with the same name. That's not possible. You need a namespace for that. So you're going to use.... a path prefix as a namespace. This is done everywhere.

So a path prefix is nothing else than a namespace, too. Schemas are a great match.

jingsam commented 1 month ago

I suppose why not just simply use schema namespaced table names like this:

curl "http://localhost:3000/public.actors?select=v1.roles(character,v2.films(title,year))"

The schema namespaced table names have the following benefits:

  1. It's insuitive. postgres use the same notion to reference tables.
  2. It's expressive. It made possible to embedding resource that join cross schema but with same table name, like we want join v1.role = v2.role.
  3. It's easy. no more Accept-Profile or Content-Profile, as in some non-browser clients, it is hard to set headers, and also in some network environments, default to block infrequently used headers.
  4. It's versionable. We could simply use /v1.actors -> /v2.actors.
  5. It's compatible. This is the most important! It does not break current schema negoitiation mechanism. If the table name is not shema namespaced, use Accept-Profile or Content-Profile to determine the table. If it is shema namespaced, the schema take predence over Accept-Profile or Content-Profile. (Of course, the schema should be whitelist with db-schemas for sercurity reason.)
cboecking commented 1 month ago

Hi Team,

I hope this comment is relevant t the thread. If not, please ignore...

I plan to use separate schemas (and therefore separate instances of postgrest connected to the same db instance) for each major version change. Here are the details.

What I like about this approach:

Note:

I hope this helps! Chuck

wolfgangwalther commented 2 weeks ago

I suppose why not just simply use schema namespaced table names like this:

curl "http://localhost:3000/public.actors?select=v1.roles(character,v2.films(title,year))"

Not sure whether using . would give us conflicts in some cases. But we could use / instead:

curl "http://localhost:3000/public/actors?select=v1/roles(character,v2/films(title,year))"

This should work equally well... if we actually want to do it, see below.

  1. It's insuitive. postgres use the same notion to reference tables.

And thus has problems with composite type's field accessors, so you need to do stuff like (table.composite_col).field etc.

We use . for different purposes, too, so might have similar problems.

  1. It's expressive. It made possible to embedding resource that join cross schema but with same table name, like we want join v1.role = v2.role.

Right - the question is whether we want to allow that at all or not. If we use incremental versioning, as proposed in the opening post, then everything that is not changed from v1, will also be available in v2, essentially. This allows you to still use schemas for something else than just versioning, without allowing cross-references in that case.

  1. It's easy. no more Accept-Profile or Content-Profile, as in some non-browser clients, it is hard to set headers, and also in some network environments, default to block infrequently used headers.

Fully agree.

  1. It's versionable. We could simply use /v1.actors -> /v2.actors.

That's the same for /v1/actors?... -> /v2/actors?....

  1. It's compatible. This is the most important! It does not break current schema negoitiation mechanism. If the table name is not shema namespaced, use Accept-Profile or Content-Profile to determine the table. If it is shema namespaced, the schema take predence over Accept-Profile or Content-Profile. (Of course, the schema should be whitelist with db-schemas for sercurity reason.)

Well, it breaks the current use-case of naming your table "a.b", i.e. with a dot inside. The approach with / doesn't break that, because we don't support it to begin with, I think.

wolfgangwalther commented 2 weeks ago

I plan to use separate schemas (and therefore separate instances of postgrest connected to the same db instance) for each major version change. Here are the details.

  • I plan to use semantic versioning (major.minor.patch) where patch changes are unannounced, minor changes are announced with automatic in-place deployment, and major (breaking) changes are released as an entirely new schema.

  • I will need to support major versions simultaneously for some period of time. For me, a major version is associated with long-term support (LTS).

  • I will use nginx to map a port to a uri prefix (port 3000 maps to some.com/v1/some_table and port 3001 maps to some.com/v2/some_table)

Sounds like a sensible approach.

  • I do not understand the comment: "Yeah, duplicating db objects it's really bad.". Does anyone use the same codebase to try to support multiple versions?

If you're not managing the multiple versions in the same codebase, how are you going to make changes to the underlying data layer, i.e. the tables, then?

If you make changes to the data tables, you will very often also need to make changes to the old api version's code to keep it working (even without introducing breaking changes). So you can't just deploy an older commit / checkout of the api in a separate schema - you'll need to keep the whole older api version in your current commit.

Once you do that, you'll want to avoid duplicating every api object, but only those that actually changed. Otherwise you have a lot of duplicate code to manage.