Automatically parsing subqueries as composite type

mitar commented 5 years ago

So in PostgreSQL one can do something like:

SELECT _id, body, (SELECT posts FROM posts WHERE posts._id=comments.post_id) AS post FROM comments

Which returns post as a composite type. Currently, this package returns:

{ _id: 'TWjntjispSBvbZQtv',
    body: { title: 'Comment title 0' },
    post: '(XKP3PeHoDRW2a7hFi,"{""title"": ""Post title 0""}")' }

Ideally, it should return something like:

{ _id: '5B4gBiAnhGY64W4GR',
    body: { title: 'Comment title 0' },
    post:
     { _id: 'yDwGwJ6x5Qc8HhotD', body: { title: 'Post title 0' } } }

I know I can achieve this by doing such query:

SELECT _id, body, (SELECT to_json(posts) FROM posts WHERE posts._id=comments.post_id) AS post FROM comments

But frankly, I am not sure why would I want to convert it to JSON and back just to get it over the wire in correct structure.

I have seen some other issues about composite types, but this is something which seems to be doable from my short exploration. It seems PostgreSQL exposes necessary information but it might require additional queries, which could be cached, I believe.

I tried the following script:

const util = require('util');

const {Pool} = require('pg');

const CONNECTION_CONFIG = {
  user: 'postgres',
  database: 'postgres',
  password: 'pass',
};

const pool = new Pool(CONNECTION_CONFIG);

(async () => {
  await pool.query(`
    CREATE OR REPLACE FUNCTION random_id() RETURNS TEXT LANGUAGE SQL AS $$
      SELECT array_to_string(
        array(
          SELECT SUBSTRING('23456789ABCDEFGHJKLMNPQRSTWXYZabcdefghijkmnopqrstuvwxyz' FROM floor(random()*55)::int+1 FOR 1) FROM generate_series(1, 17)
        ),
        ''
      );
    $$;
    DROP TABLE IF EXISTS comments;
    DROP TABLE IF EXISTS posts;
    CREATE TABLE posts (
      _id CHAR(17) PRIMARY KEY DEFAULT random_id(),
      body JSONB NOT NULL DEFAULT '{}'::JSONB
    );
    CREATE TABLE comments (
      _id CHAR(17) PRIMARY KEY DEFAULT random_id(),
      post_id CHAR(17) NOT NULL REFERENCES posts(_id),
      body JSONB NOT NULL DEFAULT '{}'::JSONB
    );
    DELETE FROM comments;
    DELETE FROM posts;
  `);

  let result;
  for (let i = 0; i < 5; i++) {
    result = await pool.query(`
      INSERT INTO posts (body) VALUES($1) RETURNING _id;
    `, [{'title': `Post title ${i}`}]);

    const postId = result.rows[0]._id;

    for (let j = 0; j < 10; j++) {
      await pool.query(`
        INSERT INTO comments (post_id, body) VALUES($1, $2);
      `, [postId, {'title': `Comment title ${j}`}]);
    }
  }

  result = await pool.query(`
    SELECT _id, body, (SELECT posts FROM posts WHERE posts._id=comments.post_id) AS post FROM comments
  `);

  console.log(util.inspect(result.rows, {depth: null}));

  result = await pool.query(`
    SELECT attname, atttypid FROM pg_catalog.pg_attribute LEFT JOIN pg_catalog.pg_type ON (pg_attribute.attrelid=pg_type.typrelid) WHERE pg_type.oid=$1 AND attisdropped=false AND attnum>=1 ORDER BY attnum
  `, [result.fields[2].dataTypeID]);

  console.log(util.inspect(result.rows, {depth: null}));

  await pool.end();
})();

Results:

[ { _id: 'BgCnoip8HiNt5Lant',
    body: { title: 'Comment title 0' },
    post: '(hjZPGf2ykyeYgFhCe,"{""title"": ""Post title 0""}")' },
  ...,
  { _id: 'Rh3cc3qGHN7v3Seq3',
    body: { title: 'Comment title 9' },
    post: '(zLLeSQFyt844ZX2a6,"{""title"": ""Post title 4""}")' } ]
[ { attname: '_id', atttypid: 1042 },
  { attname: 'body', atttypid: 3802 } ]

As you see, it is possible to get both names and type IDs for values inside post. With some recursion, this could be converted, no? Information how to parse tuple itself is here. And then we would have to parse each individual element.

noinkling commented 5 years ago

I'd love to see this happen, but it seems like a lot of work due to having to query the system catalog (at what point do you do this? do you do it lazily or eagerly? do you do it automatically or require manual registration?), cache that information, and ensure the cache doesn't become stale if things change. Related discussion for a .NET driver that implemented it: npgsql/npgsql#441

It's a shame the nested type information isn't provided by the protocol itself...

Edit: here's a link to the relevant queries/business logic in that npgsql library, if it helps:

https://github.com/npgsql/npgsql/blob/09fb57a859af9475ebd9f38b1a2e6e93df5994fb/src/Npgsql/PostgresDatabaseInfo.cs#L132

Another edit: this is how it seems to be done in Python land, much easier to follow:

noinkling commented 5 years ago

Some previous discussions for posterity:

@vitaly-t's abandoned attempt at a minimal parser (only gets the individual values as strings):

https://github.com/vitaly-t/pg-tuple
Info he compiled about the text format:
- https://github.com/vitaly-t/pg-tuple/tree/master/spec
- https://stackoverflow.com/questions/39520517/postgresql-tuple-format

Another library in another language (Ruby this time) that's able to handle them:

http://sequel.jeremyevans.net/rdoc-plugins/files/lib/sequel/extensions/pg_row_rb.html
Direct link to the text parsing code (utilizes Ruby's StringScanner class)

I'm hoping if we pool together enough information we can figure out a decent solution, which I'm assuming at this point would live in an external package.

vitaly-t commented 5 years ago

@noinkling You are looking at it in the wrong way. The main reason I abandoned my attempts was because the tuple format is now considered a legacy, sort-of obsolete, whereas JSON+JSONB is the new recommended approach. And this is where PostgreSQL team has been focusing after v9 of the server, to make JSON as fast as possible. Tuples are a history now. You should avoid using them.

noinkling commented 5 years ago

@vitaly-t JSON only supports four primitive types: strings, a single JS-style "number" type, booleans and null. If you have data using a type that doesn't match one of those (the timestamp/date types, for example), you lose type information and the ability for the library to automatically parse those values (via pg-types). You're left with having to do your own parsing on a query-by-query basis (since you need to know the structure of the result), or using a hacky and complicated workaround like manually outputting types in the result.

And even ignoring that, there's the fact that casting to JSON in Postgres can have less than ideal results in some edge cases. For example:

=# SELECT original, to_json(original) FROM (VALUES (timestamptz '20000-01-01'), (timestamptz '500-01-01 BC')) v (original);

         original          │            to_json
───────────────────────────┼────────────────────────────────
 20000-01-01 00:00:00+00   │ "20000-01-01T00:00:00+00:00"
 0500-01-01 00:00:00+00 BC │ "0500-01-01T00:00:00+00:00 BC"

Believe it or not those aren't valid ISO date strings:

https://en.wikipedia.org/wiki/ISO_8601#Years

To represent years before 0000 or after 9999, the standard also permits the expansion of the year representation but only by prior agreement between the sender and the receiver. An expanded year representation [±Y̲YYYY] must have an agreed-upon number of extra year digits beyond the four-digit minimum, and it must be prefixed with a + or − sign instead of the more common AD/BC (or CE/BCE) notation; by convention 1 BC is labelled +0000, 2 BC is labeled −0001, and so on.

In JS:

> new Date("20000-01-01T00:00:00+00:00")
Invalid Date
> new Date("0500-01-01T00:00:00+00:00 BC")
Invalid Date

In contrast, the date parser written for this library already supports the original formats fine (but not the JSON ones). In order to use that parser, you could keep the original format in JSON by casting to text first, but suffice it to say that it can make certain queries significantly more complicated/verbose/ugly, and you still have the first issue.

If composite types/tuples/row values/records (whatever they're called) were parsed properly I wouldn't have to worry about any of that.

mitar commented 5 years ago

the tuple format is now considered a legacy, sort-of obsolete, whereas JSON+JSONB is the new recommended approach

Can you provide and support for this claim? I have not seen that anywhere.

Moreover, JSON does not support many types of values. For example, infinity and nan cannot be transported with standard JSON (and PostgreSQL has strict standard JSON).

And this is where PostgreSQL team has been focusing after v9 of the server, to make JSON as fast as possible.

That is true. JSON is really fast. Even more, parsing of JSON on node side is much faster than anything else because it is so heavily optimized in node.js. I made some benchmarks to confirm that.

On the other hand, parsing JSON is recursive. And you do not know what you are getting until you get it, so you have to scan and convert. On other hand, with tuple approach from PostgreSQL, you have all the information about the structure in advance, so at least in theory, you should be able to directly parse the object knowing what is where and so you can just map, without having to first scan what a token is and then map. I have seen this being done in some projects, but I do not find now a reference.

On the other hand, if messaging would be done in something like capnproto then parsing would be zero time instead of current conversion from any memory object to another memory object by copying.

mitar commented 5 years ago

I tried building something which would parse also embedded types, automatically, but I got stuck on the case when composite type is created implicitly inside the query. For example, when you select just a subset of fields from a table and you nest those results into a larger query. How to obtain information which fields you selected? Does anyone know of any other implementation which does this? So how to obtain typing information for the whole structure? PostgreSQL should have that internally?

See my question on the mailing list here: https://www.postgresql.org/message-id/flat/CAKLmikMrm778-eETLvVAd1W_u0R8TB%2BsuAFO6jhMTmXQg3yhGg%40mail.gmail.com

FbN commented 2 years ago

On the other hand, parsing JSON is recursive. And you do not know what you are getting until you get it, so you have to scan and convert. On other hand, with tuple approach from PostgreSQL, you have all the information about the structure in advance, so at least in theory, you should be able to directly parse the object knowing what is where and so you can just map, without having to first scan what a token is and then map. I have seen this being done in some projects, but I do not find now a reference.

I regularly use DBeaver and postgres (Java App) and the client resolve and display tuple in the right way.

for example

select customer from customer limit 2

Instead the same query on node-postgres

[
  {
    customer: '(72ef86cf-2f38-4848-9f26-808b61963b0f,Rey,Schinner,"Regional Research Facilitator",Usability,)'
  },
  {
    customer: '(45a67e2f-fe57-459b-af7b-7284ddb9e47c,Evelyn,Harris,"Corporate Branding Director",Directives,)'
  }
]

mitar commented 2 years ago

Hm, this simple query might work. But what about something like SELECT _id, body, (SELECT array_agg(ROW(comments._id, comments.body)) FROM comments WHERE comments.post_id=posts._id) AS comments FROM posts? So when you generate ad-hoc types in results?

FbN commented 2 years ago

I think is correct that an "ad-hoc" type rest unparsed es:

select
    c.id,
    c."date",
    (
    select
        array_agg(row(cr.id, cr.quantity))
    from
        cart_row cr
    where
        cr.cart_id = c.id) as r
from
    cart c

but if I use table tuple I get

select
    c.id,
    c."date",
    (
    select
        array_agg(cr)
    from
        cart_row cr
    where
        cr.cart_id = c.id) as r
from
    cart c

If you need a type different from the table one I think you have to declare it and than cast es:

CREATE TYPE partial_cart_row as (
    "id" uuid,
    "quantity" smallint
);

select
    c.id,
    c."date",
    (
    select
        array_agg(row(cr.id, cr.quantity)::partial_cart_row)
    from
        cart_row cr
    where
        cr.cart_id = c.id) as r
from
    cart c

boromisp commented 2 years ago

I've started working on a library to automatically parse values of custom types (domains, composites, enums, etc.) whenever possible based on the Postgres system catalog tables.

For now, if you want to parse a composite vale, you could try this out: https://github.com/boromisp/postgres-composite This can turn a composite literal into an array of literals.

FbN commented 2 years ago

https://github.com/boromisp/postgres-composite

how does it compare to https://github.com/vitaly-t/pg-tuple ?

boromisp commented 2 years ago

They seem to be basically the same thing, with slightly different APIs. I haven't found pg-tuple when I put together postgres-composite.

pg-tuple doesn't seem to handle NULL fields correctly. For example (,"",) should be parsed as NULL, '', NULL, not as '', '', ''
There isn't a separate array parser in postgres-composite, as https://github.com/bendrucker/postgres-array should work just fine with any array literal.
The parse from postgres-composite returns Iterable<string | null> instead of an array of strings.

That said, I haven't yet tried postgres-composite with real database output, only run it through some synthetic tests.

FbN commented 2 years ago

That said, I haven't yet tried postgres-composite with real database output, only run it through some synthetic tests.

draft a simple parser based on postgres-composite

https://github.com/FbN/pg-tuple-types

Then based on pg-tuple-types I have implemented a simple Knex extension that help eager loading data pg-related.

brianc / node-postgres

Automatically parsing subqueries as composite type #1801

1127