electric-sql / electric

Sync little subsets of your Postgres data into local apps and services.
https://electric-sql.com
Apache License 2.0
6.31k stars 149 forks source link

Implement PG identifier parsing algorithm #1814

Open kevin-dp opened 1 week ago

kevin-dp commented 1 week ago

As proposed by @alco in https://github.com/electric-sql/electric/issues/1764#issuecomment-2391485961:

I would argue against going to Postgres for name lookup. Replicating Postgres' identifier parsing is much simpler because it can be implemented as a pure function. Postgres is open source and it's not like its definition of what is an identifier is going to change any time soon: no DB queries or caching required.

I have to admit that Postgres' definition of the identifier token (as specified in its lexer code) is a bit cryptic: it explicitly mentions the ranges A-Z, a-z and byte values in the range \200-\377 (those are octal numbers, in decimal they are 128 to 255), but I think the intent there is to allow any valid UTF-8 code point beyond the 7-bit ASCII Plane, so not only letters. For example, this works in psql:

=# create table πŸͺ·(id text);
CREATE TABLE
=# \d
        List of relations
 Schema β”‚ Name β”‚ Type  β”‚  Owner   
────────┼──────┼───────┼──────────
 public β”‚ fooΒ‘ β”‚ table β”‚ postgres
 public β”‚ Ξ”    β”‚ table β”‚ postgres
 public β”‚ πŸͺ·   β”‚ table β”‚ postgres

There is, apparently, a separate validation to determine that the given byte sequence is valid UTF-8 encoding (or any other encoding that the database server is configured to use) because not all sequences of bytes are valid identifiers.

Regardless, the algorithm for lower-casing identifiers is very small and straightforward. Here's its full implementation - https://github.com/postgres/postgres/blob/259a0a99fe3d45dcf624788c1724d9989f3382dc/src/backend/parser/scansup.c#L55-L73

So it is conceivable that we can implement our own validation of user-provided table identifiers by checking they are valid UTF-8 strings and don't contain certain ASCII characters. Then we can normalize the identifier by implement Postgres' downcasing algorithm to obtain a string that we can use to fetch table info from Postgres and as a cache key for storing table info in memory.

msfstef commented 5 days ago

@alco @kevin-dp the postgres algorithm depends on the server encoding as well - should we grab the encoding in the metadata acquisition stage, or just default to UTF8 for start?

msfstef commented 5 days ago

I've implemented a basic Electric.Postgres.Identifier.parse in https://github.com/electric-sql/electric/pull/1829 to serve the immediate needs of that feature.

Postgres makes use of it's server encoding and max identifier character length configuration, so we'd need to retrieve those as well as part of our "prep" and pass them along to the parsing routine. I will address that in a separate PR, probably overlaps with @alco 's work on putting all prep queries in the replication connection.

alco commented 5 days ago

@alco @kevin-dp the postgres algorithm depends on the server encoding as well - should we grab the encoding in the metadata acquisition stage, or just default to UTF8 for start?

In the spirit of "any complex system that works has evolved from a simpler system that worked" let's start with assuming UTF8 first.