PostgREST / postgrest

REST API for any Postgres database
https://postgrest.org
MIT License
22.65k stars 1k forks source link

Improvements to query language #2066

Open wolfgangwalther opened 2 years ago

wolfgangwalther commented 2 years ago

We have discussed a few shortcomings of our current query language (expressions in the query string). Taking those suggestions together - and adding a few more points, that seem to be inconsistent to me in the current implementation.

I wonder whether we should try to implement all those improvements as non-breaking separated changes - or just invent a query language v2, where we can chose via config option which language to use. Maybe #1804 can lead us to have a cleanly separated QueryLanguage module that we can replace with a new module.

Here we go:

This would leave only filters (identified by at least one . before the =), arguments (without .) and logic= query parameters, reducing the amount of "reserved keywords" by a fair bit.

Using parameters to all kinds of things etc. is also much more consistent with what the RFCs allow for those, instead of cramping everything into the query string.


Not sure whether everything in here can really be done. I haven't found a good solution for filters / limits / offsets on embedded resources, yet. It doesn't make much sense to move those to the accept header - but keeping them in the query string would create a strange interaction between query string and accept header once again.

One idea, that I haven't fully thought through, yet, could be to split the embeddings from the select part. If the embeddings were defined in the query string, but the select= in the accept header would reference those, that would be fine, I think. Having an interaction in this direction between header and query string would make sense, imho.

Embedding other resources really changes the entity we are requesting, too - so they should not be moved to the header. What about using the flexibility the path specification of a URI gives us for embedding?

GET /people,jobs

GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss
Accept: application/json;select=name,age,jobs.title

This would also play nicely with the ;single or ;one parameter for requesting a single entity, which is kind of similar to a hint of relationship type to embedd !m2m, !o2m, etc.:

GET /people;one,jobs;o2m?...

They both do kind of the same thing in returning an object instead of an array.

steve-chavez commented 2 years ago

I like many of the ideas from an HTTP standpoint, but they'd be such a big breaking change(consider all client libraries we have now) that we might as well change our name if we did them. Many users will wonder why they should not just try to migrate to a GraphQL/OData(which are behind standards) solution if we do such a change.

or just invent a query language v2

Yes, this is why it's important to document our REST syntax(ref), maybe we could do a RESTQL/RSQL/RESQL V2 if we'd ever adopt such a change. But even if we were to support both our v1 and v2 at the same time, I think that would still create doubts about our stability.

So I think before doing any major breaking changes to our REST syntax, we should offer extensibility(https://github.com/PostgREST/postgrest/issues/1909#issuecomment-952375549) to provide a way to be compatible with other standards(GraphQL/Odata).


Some comments:

GET /people?uid.eq=123 Accept: application/json;select=*

This looks better from a REST/HTTP standpoint, but there are many crippled http clients out there that can't send headers, we would leave those out or require even more use cases for proxies.

GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss

Ah, isn't ; required to be urlencoded? I'm not sure, but it seems too likely that we'd be creating impedance with http clients out there, causing similar problems to the ones our current or/and operators generated. Also, IMHO ; somehow looks bad in the URL path.

GET /people?exists=jobs!job_fk

This one looks good and is actionable, will take it to a new issue.

wolfgangwalther commented 2 years ago

Ah, isn't ; required to be urlencoded? I'm not sure, but it seems too likely that we'd be creating impedance with http clients out there, causing similar problems to the ones our current or/and operators generated. Also, IMHO ; somehow looks bad in the URL path.

No, they are specifically allowed for that purpose: https://datatracker.ietf.org/doc/html/rfc3986#section-3.3

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.

wolfgangwalther commented 2 years ago

I like many of the ideas from an HTTP standpoint, but they'd be such a big breaking change(consider all client libraries we have now) that we might as well change our name if we did them. Many users will wonder why they should not just try to migrate to a GraphQL/OData(which are behind standards) solution if we do such a change.

I certainly don't want to propose a general breaking change in the API here. I think there is value in "providing the most HTTP-standard compliant interface" in itself, but also many of those proposals solve actual problems we have.

Yes, this is why it's important to document our REST syntax(ref), maybe we could do a RESTQL/RSQL/RESQL V2 if we'd ever adopt such a change. But even if we were to support both our v1 and v2 at the same time, I think that would still create doubts about our stability.

So I think before doing any major breaking changes to our REST syntax, we should offer extensibility(#1909 (comment)) to provide a way to be compatible with other standards(GraphQL/Odata).

If we offer an extensible interface for the query language and have both our V1 and V2 to make use of that interface / prove that it's working well - I don't see us creating doubts about stability. It's the other way around, I think. It increases confidence in a stable non-breaking interface, because new features, that would imply breaking changes can much easier be developed while keeping backwards compatibility in other versions of our language.

steve-chavez commented 2 years ago

Accept: application/json;select=lastname,firstname

Now that we're closer to having XML support for tables, the select above seems redundant for each mimetype.

I was wondering if there's a more standard way for embedded resources. I've been looking at HAL, but its verbosity makes me think it's not adequate for big result sets. Also users would need HAL-specific clients to really use the data.

Following the line of thought of changing the path syntax and that embedding resources have a different type, I thought of representing the result of an embed as a different resource - GET /usersEmbedClients - but that of course wouldn't work for many embeds and tree-structured data.

An embed query parameter looks more "standard", if we'd adopt it I guess we could do:

GET /projects?select=id,name,clients(name,tasks(name))

As

GET /projects?embed=clients&clients.embed=tasks&select=id,name&clients.select=name&clients.tasks.select=name

Which has the disavantage of being more verbose and more importantly it would lose the ability of having an order for the fields in the resulting body(since proxies don't guarantee the order of the query parameters IIRC).

With the http QUERY method, IMO the http working group likely has given up on representing embedded data cleanly within the URL syntax restrictions(there's even SQLish examples in the spec). With that, it feels futile trying to improve our embedded resource syntax.

I do agree that changing the operator syntax would be a win though.

steve-chavez commented 1 year ago

Once https://github.com/PostgREST/postgrest/issues/1970#issuecomment-1382613004 is implemented, I think we could support both the old and new filter syntax without a breaking change.

Change operator syntax from col=op.val to col.op=val

Looking at the parsers I think it'd be easier to do op.col=val than col.op=val.

col=val would be eq.col=val implicitly.

steve-chavez commented 1 year ago

Some ideas..

Embedding other resources really changes the entity we are requesting, too - so they should not be moved to the header. What about using the flexibility the path specification of a URI gives us for embedding? GET /people,jobs GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss Accept: application/json;select=name,age,jobs.title

Doesn't embedding just add attributes to the entity? You still do a FROM base_entity JOIN other on SQL after all. We could just say an embedding is a different representation of a resource(table).

Edit: Totally misread the RFC. It was related to media type parameters not query parameters. Still the new media type below might be worth considering.

Also RFC 2046 says:

Parameters are modifiers of the media subtype, and as such do not fundamentally affect the nature of the content. The set of meaningful parameters depends on the media type and subtype.

So the query parameters depend on the media type, we don't need to put the select on the header. Also, omitting columns from select doesn't fundamentally change the nature of the content - it's still json, we're just leaving some fields as 'undefined'.

Considering the above, it sounds possible to document the embedding possibilities for a resource. In OpenAPI we can list the Accept media types. Using films in resource embedding, we could say it accepts:

paths:
  /films:
    get:
      ...
          content:
            application/vnd.pgrst.related+json; rel=films>-directors
            application/vnd.pgrst.related+json; rel=films-<nominations
            application/vnd.pgrst.related+json; rel=films-<roles
            application/vnd.pgrst.related+json; rel=films>-<actors
            application/vnd.pgrst.related+json; rel=films>-<competitions

>- = many-to-one. -< = one-to-many. >-< many-to-many. one-to-one could be --.

So say we get the following request(with nesting):

GET /films?select=*,roles(*, actors(*)),directors(*)

We would respond with our usual json, but with the added header:

Content-Type: application/vnd.pgrst.related+json; rel=films-<roles,films>-directors,roles>-actors

The gain here is that we could document the relationships a resource has. Which was previously impossible in OpenAPI.

wolfgangwalther commented 1 year ago

Considering the above, it sounds possible to document the embedding possibilities for a resource. In OpenAPI we can list the Accept media types. Using films in resource embedding, we could say it accepts:

This would only make sense if the embedding were to take place via accept header / mediatype. I don't think that would be correct. Adding an embedding will return a new entity - it will not merely change the return format. That means it does not belong in the header.

The gain here is that we could document the relationships a resource has. Which was previously impossible in OpenAPI.

Really what I suggested above about embeddings and the path component would be the way forward. Just requesting people is different from just requesting jobs is different from requesting a combination of both. If we had something like GET /people,jobs, then it would also be possible to add those new paths to the OpenAPI output.

erickedji commented 1 year ago

Change operator syntax from col=op.val to col.op=val

As another datapoint, this also helps when user input is passed to a stored procedure via GET.

Currently, query=foobar is processed correctly for CREATE FUNCTION search(query TEXT). But with query=in.foobar, Postgrest no longer interprets query as a function param, but as a filter.

There doesn't seem to be a simple way to force the function param interpretation (quoting required preprocessing in the stored procedure to remove the quotes).

steve-chavez commented 11 months ago

Adding an embedding will return a new entity - it will not merely change the return format. That means it does not belong in the header.

Just noticed that even the Prefer: exact=count hints that the above is true. The count is different when embedding is done (see https://github.com/PostgREST/postgrest/issues/2009 and https://github.com/PostgREST/postgrest/issues/2846). So indeed it's a new entity and doesn't belong in the header, makes sense because people,jobs can be represented in json, csv or other media types.

GET /people,jobs;jobs_fk?age.eq=15&jobs.title.eq=Boss

This would also play nicely with the ;single or ;one parameter for requesting a single entity, which is kind of similar to a hint of relationship type to embedd !m2m, !o2m, etc.: GET /people;one,jobs;o2m?...

Now that we're moving on from doing disambiguation on the URL(https://github.com/PostgREST/postgrest/issues/2863), the syntax can be much simpler so we don't need the FK or the cardinality in the path. So we could just go with GET /people,jobs.


There's a problem in expressing nested embedding on the path, since it needs to shape the JSON output. For this I believe we should keep the select parameter.

GET /actors,roles,films?select=roles(character,films(title,year))

Or later as a media type parameter too:

GET /actors,roles,films

Accept: application/json;select=roles(character,films(title,year))

If we had something like GET /people,jobs, then it would also be possible to add those new paths to the OpenAPI output.

Yes, and we could do this now without a breaking change.


Another syntax:

GET /users(projects,tasks(subtasks))

That would allow us to express joins on resources.

steve-chavez commented 10 months ago

Adding an embedding will return a new entity - it will not merely change the return format.

If doing an embedding (a foreign key join) creates a new entity, what about an aggregate like SUM, AVG, etc? I mean it has the potential to reduce a REST collection to a single resource.

I'm mostly trying to see if we could do away without adding GROUP BY.

Been checking project m36 aggregates and they say:

SQL supports aggregate queries using aggregate functions (SUM(), COUNT(), AVERAGE(), etc.) and GROUP BY while the aggregate functions in the relational algebra simply accept relation values as arguments.

But I haven't had the chance to play it with yet (their nix-shell took ages), so not sure if they don't really need GROUP BY.

steve-chavez commented 10 months ago

PRQL has a good approach for this (ref): aggregates don't change the number of rows by default but only add a new column using window functions:

https://prql-lang.org/book/reference/stdlib/transforms/aggregate.html#aggregate-is-required

This will prevent failure from clients whenever they forget to add GROUP BY.


So I think the above answers the previous question:

If doing an embedding (a foreign key join) creates a new entity, what about an aggregate like SUM, AVG, etc? I mean it has the potential to reduce a REST collection to a single resource.

An aggregate doesn't have to change the amount of rows. We can say that GROUP BY is a filter in a way.

steve-chavez commented 10 months ago

Another improvement for the query language: use ~ for not. It's shorter/clearer (likely easier for the parsers) and it doesn't require urlencoding: https://perishablepress.com/stop-using-unsafe-characters-in-urls/

/a?col.~eq=3