Custom media types + conversion for `Content-Type`

steve-chavez commented 1 year ago

Problem

Currently we allow handling raw input for application/json, text/xml, text/plain, application/octet-stream on functions(ref).

However it's inflexible, no other custom media types can be handled. It also doesn't work for tables/views.

Solution

Use a domain-based media type as a function unnamed parameter to do custom conversion. Unlike ref this can work for both functions and tables/views.

Functions interface

CREATE DOMAIN "application/json" as json;
CREATE DOMAIN "application/octet-stream" as bytea;

CREATE FUNCTION mult_them("application/json") RETURNS int AS $$
  SELECT ($1->>'x')::int * ($1->>'y')::int
$$ LANGUAGE SQL;

CREATE FUNCTION upload_binary("application/octet-stream") RETURNS void AS $$
  INSERT INTO files(blob) VALUES ($1);
$$ LANGUAGE SQL;

Tables/views interface

CREATE FUNCTION convert_projects("application/json") RETURNS projects AS $$
-- ...
$$ LANGUAGE SQL;

CREATE FUNCTION convert_projects("text/plain") RETURNS projects AS $$
-- ...
$$ LANGUAGE SQL;

CREATE FUNCTION convert_projects("text/xml") RETURNS projects AS $$
-- ...
$$ LANGUAGE SQL;

e.g. the first function will be used when doing:

POST /projects 
Content-Type: application/json

This means we'll use the unnamed parameter + return type to apply a conversion function for a particular table/view.

Inside the function a json_to_recordset can be used (this is what we do internally). With this interface the user can override our default behavior.

Thanks to overloaded functions the same function name can be used to handle different media types.

Notes

Consider dropping Functions with a single unnamed parameter after solving this. Advantages:
- This new interface is more intuitive and less magical.
- Simplify error messages.
- Simplify searching the rpc function.
- It will also be consistent with https://github.com/PostgREST/postgrest/pull/2825.
- The forward migration is relatively simple - changing a parameter type.
The domain-based media types idea comes from https://github.com/PostgREST/postgrest/pull/1582 and https://github.com/PostgREST/postgrest/pull/2523. (long threads) .

I did not have the time, yet, to follow along what exactly has been implemented for Accept already. Am I right that currently it's only possible to say RETURNS "media/type" for functions using a domain media/type? So we don't support any of the CAST idea, yet, but all the pieces are in place to be able to extend that with casts later on?

Assuming the above is correct, then:

The functions interface suggested in this issue seems like exactly the same thing, just for input via Content-Type. It will be extensible the same way with CASTs. :+1:
However, the proposed table/view interface looks different. In which schema are those functions defined? If they are defined in the exposed schema, then what's the difference between those functions and a regular RPC ("function interface style")? If they are not in the exposed schema, how do we detect them?

The answer to those questions in the second paragraph is: Casts. Those functions are effectively implementing a cast, so we should create proper casts for them and detect them via casts. Then they can be located anywhere, don't conflict with RPCs, etc.

I would suggest to do the function interface first, and then extend later on.

steve-chavez commented 10 months ago

Am I right that currently it's only possible to say RETURNS "media/type" for functions using a domain media/type? So we don't support any of the CAST idea, yet, but all the pieces are in place to be able to extend that with casts later on?

Yes (for aggregates too (https://github.com/PostgREST/postgrest/pull/2825#issue-1758144649)) and yes.

I would suggest to do the function interface first, and then extend later on.

Agree, that sounds good.

steve-chavez commented 8 months ago

The answer to those questions in the second paragraph is: Casts. Those functions are effectively implementing a cast, so we should create proper casts for them and detect them via casts. Then they can be located anywhere, don't conflict with RPCs, etc.

Oh, thinking more about the above. Wouldn't the interface for media type handlers be simpler too if it's defined via CASTs?

So the example on https://postgrest.org/en/stable/references/api/media_type_handlers.html#handlers-for-tables-views would be:

create or replace function twkb_handler_transition (state bytea, next lines)
returns bytea as $$
  select state || st_astwkb(next.geom);
$$ language sql;

create or replace aggregate twkb_agg (lines) (
  initcond = ''
, stype = bytea
, sfunc = twkb_handler_transition
);

CREATE OR REPLACE FUNCTION twkb(lines) RETURNS "application/vnd.twkb" AS $$
  select twkb_agg($1);
$$ LANGUAGE SQL IMMUTABLE;

CREATE CAST (lines AS "application/vnd.twkb") WITH FUNCTION twkb(lines) AS IMPLICIT;

That way the aggregate is an internal detail and the media type is not at all involved in its definition. I believe it's much more clear despite having to define one more function too.

@wolfgangwalther WDYT? This would be a breaking change on current media type handlers, but not for functions only for tables (which doesn't seem to be used that much yet).

steve-chavez commented 8 months ago

Damn, the above makes sense but the current query doesn't work:

WITH pgrst_source AS ( SELECT "test"."lines".* FROM "test"."lines")
SELECT
    null::bigint AS total_result_set,
    pg_catalog.count(_postgrest_t) AS page_total,
    twkb(_postgrest_t::"test"."lines") AS body,
    nullif(current_setting('response.headers', true), '') AS response_headers,
    nullif(current_setting('response.status', true), '') AS response_status,
    '' AS response_inserted
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;

ERROR:  column "_postgrest_t.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 5:     twkb(_postgrest_t::"test"."lines") AS body,

And an aggregate cannot be used as cast function:

CREATE CAST (lines AS "application/vnd.twkb") WITH FUNCTION twkb_agg(lines) AS IMPLICIT;

ERROR:  cast function must be a normal function

steve-chavez commented 8 months ago

The only way I see to make the CAST abstraction work is to scan the body of the CREATE FUNCTION twkb and apply the internal function ourselves. Too bad that pg can't see that there's an aggregate inside the SQL function.

steve-chavez commented 8 months ago

and apply the internal function ourselves

Maybe that's not too bad, we can apply the function body directly like:

WITH pgrst_source AS ( SELECT "test"."lines".* FROM "test"."lines")
SELECT
    null::bigint AS total_result_set,
    pg_catalog.count(_postgrest_t) AS page_total,
    (select twkb_agg(_postgrest_t::"test"."lines")) AS body,
    nullif(current_setting('response.headers', true), '') AS response_headers,
    nullif(current_setting('response.status', true), '') AS response_status,
    '' AS response_inserted
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;

That also lets apply functions on top of the aggregation, which is not possible with the current interface. This makes it possible to do select coalesce(json_agg($1),[])->0 to override application/vnd.pgrst.object.

The any handler interface would have to change though, since the aggregate final function would no longer be used for setting the response headers.

wolfgangwalther commented 8 months ago

The "correct" way to do this, would be to use array_agg(...) all the time, once you hit a CAST, so something like this:

-- note the argument type is now an array
CREATE OR REPLACE FUNCTION twkb(lines[]) RETURNS "application/vnd.twkb" AS $$
  -- whatever you need to do to the array
$$ LANGUAGE SQL IMMUTABLE;

CREATE CAST (lines[] AS "application/vnd.twkb") WITH FUNCTION twkb(lines[]) AS IMPLICIT;

WITH pgrst_source AS ( SELECT "test"."lines".* FROM "test"."lines")
SELECT
    null::bigint AS total_result_set,
    pg_catalog.count(_postgrest_t) AS page_total,
    twkb(array_agg(_postgrest_t::"test"."lines"))) AS body,
    nullif(current_setting('response.headers', true), '') AS response_headers,
    nullif(current_setting('response.status', true), '') AS response_status,
    '' AS response_inserted
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;

That means the user can either provide custom aggregates, probably mostly useful when implemented in C for performance (i.e. pass through the full response once instead of twice with the cast). Or they can provide the cast, in which case we still do the aggregation for them beforehand, but in a type-preserving way via array_agg(). This way you can still create different casts for different endpoints.

steve-chavez commented 8 months ago

The "correct" way to do this, would be to use array_agg(...) all the time, once you hit a CAST, so something like this:

How would that work? I'm guessing like:

CREATE OR REPLACE FUNCTION twkb(lines[]) RETURNS "application/vnd.twkb" AS $$
  select twkb_agg(x) from unnest($1) x;
$$ LANGUAGE SQL IMMUTABLE;

Wouldn't it be wasteful to aggregate the array (twkb(array_agg(_postgrest_t::"test"."lines"))) and unnest it again?

AFAICT, operations over arrays on pg always require unnesting, there's no array.map() function (ref).

That means the user can either provide custom aggregates, probably mostly useful when implemented in C for performance. Or they can provide the cast

Hm, ideally handling both content-type and accept would use the same interface with CASTs. Otherwise this complicates the docs and the implementation.

Currently it doesn't look good how we mix the media type with the aggregate definition.

wolfgangwalther commented 8 months ago

Wouldn't it be wasteful to aggregate the array (twkb(array_agg(_postgrest_t::"test"."lines"))) and unnest it again?

AFAICT, operations over arrays on pg always require unnesting, there's no array.map() function (ref).

What makes you think that unnest/agg is less efficient than a map function if that was implemented? I have no idea whether it's inefficient, but unnest/agg just seems to be idiomatic SQL code. So certainly not bad. Also note that this is a special case for a cast function written in SQL. For example the function given in https://github.com/PostgREST/postgrest/issues/3160#issue-2082417765 would be much more efficient with array_agg. Currently it uses a poor-man's aggregation via json/jsonb, but that's just because that's the only way to do it right now. An aggregation via array_agg would be much better - and then the python code doesn't care how it receives the data.

Hm, ideally handling both content-type and accept would use the same interface with CASTs.

Agreed. But the example in https://github.com/PostgREST/postgrest/issues/2826#issue-1758188980 is insufficient:

CREATE FUNCTION convert_projects("application/json") RETURNS projects AS $$
-- ...
$$ LANGUAGE SQL;

This would only ever allow a single project to be returned. What if the json document contained multiple projects? I'm quite sure that a cast can't return a SET OF. So the only thing you could return would be projects[] - and then PostgREST would have to unnest that to insert etc.

This is exactly the opposite of array_agg.

So basically:

Casts should work on arrays. Or rather: If a cast takes or returns an array, we automagically add array_agg / unnest respectively. In the future, we could add chains of casts, too, as discussed earlier in connection with data reps. Those would not need to be on arrays.
The function interface can return SET OF. The corresponding output (i.e. "how to transform a set into a scalar?") are.. aggregates.

Very symmetric.

PostgREST / postgrest