Expose PG wire binary encoder for use in Postgrex.stream/4 encoding

tsutsu commented 6 years ago

Let's say you're writing an ETL pipeline with an Elixir process as the "transform" step and Postgres as the target of the "load" step. This sounds like a job for Postgrex.stream(conn, "COPY foo FROM STDIN", [])! Ideally, you want to take rows of data that you're generating in Elixir, and stream them right into the Postgrex.Stream.t.

Right now, this isn't directly supported in any way. All the examples of bulk-loading with Postgres.stream/4 as a destination, stream from a file—presumably one generated by a previous batch operation, e.g. COPY foo TO STDOUT. In other words, it's always a case where "someone else" handled the actual encoding of rows into a format the Postgres COPY command can consume.

There's no good way, AFAIK, to generate that format in Elixir. (Work is going on in Ecto to support Ecto.Adapters.SQL.Stage, I believe? This may end up allowing for feeding Ecto schema structs directly to the Ecto GenStage consumer. But if you don't have an Ecto schema—if you aren't even using Ecto—this won't help you.)

You can somewhat rely on parameterizing the call as COPY foo FROM STDIN WITH (FORMAT csv), and then use an Elixir CSV-dumping library to feed in your data. But even then, the resulting CSV isn't quite correct according to Postgres's needs—you end up needing a wrapper module like this one to fix all the eccentricities of the format, at which point you may as well be writing your own encoder.

I feel like Postgrex itself should expose some way to encode data being fed to COPY foo FROM STDIN (and decode data produced by COPY foo TO STDOUT.) Postgrex already does its own transparent encoding/decoding between Elixir types and Postgres types, in the form of Postgrex.execute/4 binding parameters and results, so it would make sense to me for Postgrex to also handle encoding/decoding between Elixir types and Postgres types for this call at a similar "level of abstraction"—either transparently as well (maybe with an encoding option that can be passed to Postgrex.stream) or just by exposing a module that can be used to perform this encoding in a simple, intuitive manner.

For the encoding itself, there are several options. Postgres currently supports three COPY encodings: csv, text, and binary. Postgrex's csv encoding is messy (as seen above) and supporting it would essentially require postgrex to either reimplement, or depend on, a CSV library. And Postgres's text encoding is an entire custom format that still requires the overhead of stringification.

On the other hand, Postgres's binary encoding, a.k.a. the PGCOPY format described here, is a binary container format wrapping the regular binary send/recv wire encoding of Postgres types—which Postgrex already knows how to encode/decode. This has many advantages: it has low on-the-wire size; it can be generated by building iodata from existing binaries with no processing or copying; and it doesn't require writing any new encoding logic for the primitives of the format, only for the (very simple) container format. (Assuming, that is, that you're putting this logic in Postgrex. Since Postgrex doesn't directly expose its wire encoding primitives for use by other libraries, a third-party library implementing the PGCOPY format would have to duplicate a lot of Postgrex's work!)

It would make sense to me to approach this problem like so:

move the encoding and decoding of primitive Postgres types for Postgres's binary wire format out into being public functions in a separate module (perhaps even its own library, e.g. postgrex_wire);
create a module Postgrex.PGCOPY (which could be part of Postgrex, or its own library where both postgrex and postgrex_pgcopy depend on postgrex_wire), where Postgrex.PGCOPY presents an interface with conventional functions such as decode(!)/1,2, encode_to_iodata/1, encode_to_stream/1, etc.
(optional) modify Postgrex.stream to take an encoder or decoder option (a necessary generalization of the current decode_mapper option, given that the PGCOPY format requires a header and trailer.) Postgrex.PGCOPY would be a valid encoder or decoder. Other modules, e.g. PostgresCSV, could also be implemented for use here if users want them. (These other modules be necessary if attempting to interface with only semi-PG-wire-compatible DBMSes like CockroachDB or Aurora Postgres.)

josevalim commented 6 years ago

It feels like we could simply support format: :binary as a stream option?

tsutsu commented 6 years ago

Yes, that would be the simplest interface.

Exposing a Postgrex.PGCOPY module directly, would also allow the format to be used apart from its "online" usage. That is:

for parsing/decoding files previously produced by e.g. \copy foo TO 'foo.pgcopy' WITH (FORMAT binary) in psql
for dumping/encoding files that can later be fed to \copy foo FROM 'foo.pgcopy' WITH (FORMAT binary), in psql—or which can simply be fed into Postgrex.stream/4 without encoding in a separate stage, as desired (i.e. decoupling encode-time from load-time, which is helpful if both the transform and load stages are time-intensive and either might fail.)

A distinct Postgrex.PGCOPY encoder module would also form the basis for a potential Postgrex.PGDump.Custom encoder module, given that (AFAIK) the pg_dump -f custom format is a binary container that itself contains these PGCOPY binaries. This would allow for the implementation of many utilities around PG database maintenance/imports/exports. (For example, entire datasets could be generated "in-place" in pg_dump -f custom format, ready to be passed off to a DB admin for importing with pg_restore.)

These are why I suggest moving the wire encoding functionality out of postgrex—these aren't really relevant considerations to the goals of postgrex itself, but these use-cases (ideally separate libraries) necessarily share the need for this common PG wire-format encoding logic with postgrex.

josevalim commented 6 years ago

So we should just expose the encoding/decoding from postgrex into something usable. A little bit tricky to do in practice, because of Postgres Extensions, but certainly doable.

tsutsu commented 6 years ago

An additional wrinkle I forgot to mention the first time: for a row to be encodable into the PGCOPY binary format, the encoder needs to be aware of the exact PG types of the destination table (e.g. whether a column is smallint vs integer vs bigint; or text vs bytea; or json vs. jsonb.) Unlike in the wire protocol, it is not enough to simply encode types as "something losslessly castable to the row's type", because Postgres won't perform any casting during a COPY.

Let me break down what that implies:

Encoding

The encoder would need to be passed the Postgres "row type" (the definition of the composite type created alongside a table, i.e. the Postgres-internal types of all the columns) of the table being encoded against, in order to generate valid PGCOPY data targeting said table.

This could be handled transparently if the encoder is passed an active Postgrex connection, and retrieves the row type for the table using the connection. But I think it would be quite limiting if it was only possible to generate PGCOPY files "online"—especially if the database the data is targeting hasn't even been stood up yet, and this data generation step is part of that stand-up procedure.

However, in an "offline" scenario, since you can't get the row-type definition from the DB, you need the user to pass it in explicitly. This exposes "internal" type distinctions to the user, that thus-far haven't needed to be exposed, which perhaps makes this a bad API to have. (See thoughts on possible Ecto-layer solutions for this below.)

Decoding

None of the raw dumped COPY TO STDOUT formats contain enough data to reconstruct the types from the data stream. (While the CSV format has a header, this merely specifies the names of the target columns, not their types.)

For online decoding, this is fine—the decoder can, like the encoder, require an active Postgrex connection.

If offline decoding is a desired use-case, though, we have more options than for offline encoding. We can still accept a row-type definition directly; but we can also accept file types (and have offline encoding generate these file types) that are "SQL-wrapped", in a way that embeds the row type.

For example, the encoder could be passed something like format: {:sql, :csv}, which would result in a file looking like:

COPY TO public.foo (col1 bigint, col2 bytea[], col3 text, col4 text, col5 text) WITH (FORMAT csv, HEADER false, NULL '\N', DELIMITER E'\t', QUOTE '''');
1   \x00010203  hi  \N  ''
\.

The dump above embeds the row type in a way the decoder could parse out, obviating the need to pass row-type information to the decoder. It is also:

a valid .sql file as far as psql -f is concerned; and
streamable directly into the DB using Postgrex.stream!(conn, "", [])

And this "SQL-wrapping" works just as well with the PGCOPY binary format as the payload, as it does with the CSV or text formats.

In all, such "SQL-wrapped" dumps are very useful formats to dump to. I'd suggest that format: {:sql, :binary} would even be a good default format.

(Side-note: notice how, in the above "SQL-wrapped" dump file, particular formatting parameters are being forced by Postgrex, rather than having been specified by the user. This means that, specifically in the CSV use-case, a CSV encoder module can be written that is specialized for the particular formatting parameters that allow for a lossless, bijective encoding. Without the ability to force formatting parameters, a lossless encoding can't reliably be achieved, because e.g. NULL fields and empty string fields can't be reliably differentiated.)

Hiding the Complexity of Offline Use-cases with Ecto

I think Ecto could be used to paper over the dirtiness of requiring the user to be aware of DBMS-internal types in an offline use-case, such that users could be told that the idiomatic way to do this would be to create an Ecto.Schema for their table and then use some new higher-level API like Ecto.Repo.dump_to_stream that acts like Ecto.Repo.insert_all but just streams out the result as iodata.

Supporting this would, itself, require some "interesting" changes to Ecto, though. Specifically, Ecto would need to have some level of offline awareness of the DBMS's current row types. I can imagine a few approaches for this, all of them probably more trouble than they're worth:

make Ecto.Migrations read back the created row type after creating or altering a table, and write the changes to a priv/repo/current.schema file (maybe using DETS?), much like Mnesia does;
or, create an Elixir compiler unit for the file priv/repo/structure.sql file created by mix ecto.dump, such that if it is available, a module is built from it which exposes the row types of each table; and then modify Ecto.Repo to find and load such a "type hints" module; and then use the type hints directly in the logic of Ecto.Repo.dump_to_stream.

Of course, if one was already extracting these type hints, you could extract other table metadata at the same time, e.g. NOT NULL field annotations, PRIMARY KEY table annotations, UNIQUE constraint specifiers, REFERENCES annotations, etc. These could be used to do some interesting code-generation for Ecto projects, perhaps even to the point of having no explicit hand-rolled schema definition part, and less changeset-related code, in Ecto.Schema source files at all.

That's its own whole project, though. 😛

josevalim commented 6 years ago

Just a heads up that if somebody wants to expose encoding/decoding as its own API, we would welcome the change, but it is not something we plan to work on ourselves.

josevalim commented 6 years ago

I will go ahead and close this. As I said, we will be glad to discuss those changes, but it is not something we plan to tackle ourselves. Thanks!

elixir-ecto / postgrex