Closed tsutsu closed 6 years ago
It feels like we could simply support format: :binary
as a stream option?
Yes, that would be the simplest interface.
Exposing a Postgrex.PGCOPY
module directly, would also allow the format to be used apart from its "online" usage. That is:
\copy foo TO 'foo.pgcopy' WITH (FORMAT binary)
in psql
\copy foo FROM 'foo.pgcopy' WITH (FORMAT binary)
, in psql
—or which can simply be fed into Postgrex.stream/4
without encoding in a separate stage, as desired (i.e. decoupling encode-time from load-time, which is helpful if both the transform and load stages are time-intensive and either might fail.)A distinct Postgrex.PGCOPY
encoder module would also form the basis for a potential Postgrex.PGDump.Custom
encoder module, given that (AFAIK) the pg_dump -f custom
format is a binary container that itself contains these PGCOPY
binaries. This would allow for the implementation of many utilities around PG database maintenance/imports/exports. (For example, entire datasets could be generated "in-place" in pg_dump -f custom
format, ready to be passed off to a DB admin for importing with pg_restore
.)
These are why I suggest moving the wire encoding functionality out of postgrex
—these aren't really relevant considerations to the goals of postgrex
itself, but these use-cases (ideally separate libraries) necessarily share the need for this common PG wire-format encoding logic with postgrex
.
So we should just expose the encoding/decoding from postgrex into something usable. A little bit tricky to do in practice, because of Postgres Extensions, but certainly doable.
An additional wrinkle I forgot to mention the first time: for a row to be encodable into the PGCOPY
binary format, the encoder needs to be aware of the exact PG types of the destination table (e.g. whether a column is smallint
vs integer
vs bigint
; or text
vs bytea
; or json
vs. jsonb
.) Unlike in the wire protocol, it is not enough to simply encode types as "something losslessly castable to the row's type", because Postgres won't perform any casting during a COPY
.
Let me break down what that implies:
The encoder would need to be passed the Postgres "row type" (the definition of the composite type created alongside a table, i.e. the Postgres-internal types of all the columns) of the table being encoded against, in order to generate valid PGCOPY
data targeting said table.
This could be handled transparently if the encoder is passed an active Postgrex connection, and retrieves the row type for the table using the connection. But I think it would be quite limiting if it was only possible to generate PGCOPY
files "online"—especially if the database the data is targeting hasn't even been stood up yet, and this data generation step is part of that stand-up procedure.
However, in an "offline" scenario, since you can't get the row-type definition from the DB, you need the user to pass it in explicitly. This exposes "internal" type distinctions to the user, that thus-far haven't needed to be exposed, which perhaps makes this a bad API to have. (See thoughts on possible Ecto-layer solutions for this below.)
None of the raw dumped COPY TO STDOUT
formats contain enough data to reconstruct the types from the data stream. (While the CSV format has a header, this merely specifies the names of the target columns, not their types.)
For online decoding, this is fine—the decoder can, like the encoder, require an active Postgrex connection.
If offline decoding is a desired use-case, though, we have more options than for offline encoding. We can still accept a row-type definition directly; but we can also accept file types (and have offline encoding generate these file types) that are "SQL-wrapped", in a way that embeds the row type.
For example, the encoder could be passed something like format: {:sql, :csv}
, which would result in a file looking like:
COPY TO public.foo (col1 bigint, col2 bytea[], col3 text, col4 text, col5 text) WITH (FORMAT csv, HEADER false, NULL '\N', DELIMITER E'\t', QUOTE '''');
1 \x00010203 hi \N ''
\.
The dump above embeds the row type in a way the decoder could parse out, obviating the need to pass row-type information to the decoder. It is also:
.sql
file as far as psql -f
is concerned; andPostgrex.stream!(conn, "", [])
And this "SQL-wrapping" works just as well with the PGCOPY
binary format as the payload, as it does with the CSV or text formats.
In all, such "SQL-wrapped" dumps are very useful formats to dump to. I'd suggest that format: {:sql, :binary}
would even be a good default format.
(Side-note: notice how, in the above "SQL-wrapped" dump file, particular formatting parameters are being forced by Postgrex, rather than having been specified by the user. This means that, specifically in the CSV use-case, a CSV encoder module can be written that is specialized for the particular formatting parameters that allow for a lossless, bijective encoding. Without the ability to force formatting parameters, a lossless encoding can't reliably be achieved, because e.g. NULL fields and empty string fields can't be reliably differentiated.)
I think Ecto could be used to paper over the dirtiness of requiring the user to be aware of DBMS-internal types in an offline use-case, such that users could be told that the idiomatic way to do this would be to create an Ecto.Schema
for their table and then use some new higher-level API like Ecto.Repo.dump_to_stream
that acts like Ecto.Repo.insert_all
but just streams out the result as iodata.
Supporting this would, itself, require some "interesting" changes to Ecto, though. Specifically, Ecto would need to have some level of offline awareness of the DBMS's current row types. I can imagine a few approaches for this, all of them probably more trouble than they're worth:
make Ecto.Migration
s read back the created row type after creating or altering a table, and write the changes to a priv/repo/current.schema
file (maybe using DETS?), much like Mnesia does;
or, create an Elixir compiler unit for the file priv/repo/structure.sql
file created by mix ecto.dump
, such that if it is available, a module is built from it which exposes the row types of each table; and then modify Ecto.Repo
to find and load such a "type hints" module; and then use the type hints directly in the logic of Ecto.Repo.dump_to_stream
.
Of course, if one was already extracting these type hints, you could extract other table metadata at the same time, e.g. NOT NULL
field annotations, PRIMARY KEY
table annotations, UNIQUE
constraint specifiers, REFERENCES
annotations, etc. These could be used to do some interesting code-generation for Ecto projects, perhaps even to the point of having no explicit hand-rolled schema definition part, and less changeset-related code, in Ecto.Schema
source files at all.
That's its own whole project, though. 😛
Just a heads up that if somebody wants to expose encoding/decoding as its own API, we would welcome the change, but it is not something we plan to work on ourselves.
I will go ahead and close this. As I said, we will be glad to discuss those changes, but it is not something we plan to tackle ourselves. Thanks!
Let's say you're writing an ETL pipeline with an Elixir process as the "transform" step and Postgres as the target of the "load" step. This sounds like a job for
Postgrex.stream(conn, "COPY foo FROM STDIN", [])
! Ideally, you want to take rows of data that you're generating in Elixir, and stream them right into thePostgrex.Stream.t
.Right now, this isn't directly supported in any way. All the examples of bulk-loading with
Postgres.stream/4
as a destination, stream from a file—presumably one generated by a previous batch operation, e.g.COPY foo TO STDOUT
. In other words, it's always a case where "someone else" handled the actual encoding of rows into a format the PostgresCOPY
command can consume.There's no good way, AFAIK, to generate that format in Elixir. (Work is going on in Ecto to support
Ecto.Adapters.SQL.Stage
, I believe? This may end up allowing for feeding Ecto schema structs directly to the Ecto GenStage consumer. But if you don't have an Ecto schema—if you aren't even using Ecto—this won't help you.)You can somewhat rely on parameterizing the call as
COPY foo FROM STDIN WITH (FORMAT csv)
, and then use an Elixir CSV-dumping library to feed in your data. But even then, the resulting CSV isn't quite correct according to Postgres's needs—you end up needing a wrapper module like this one to fix all the eccentricities of the format, at which point you may as well be writing your own encoder.I feel like Postgrex itself should expose some way to encode data being fed to
COPY foo FROM STDIN
(and decode data produced byCOPY foo TO STDOUT
.) Postgrex already does its own transparent encoding/decoding between Elixir types and Postgres types, in the form ofPostgrex.execute/4
binding parameters and results, so it would make sense to me for Postgrex to also handle encoding/decoding between Elixir types and Postgres types for this call at a similar "level of abstraction"—either transparently as well (maybe with anencoding
option that can be passed toPostgrex.stream
) or just by exposing a module that can be used to perform this encoding in a simple, intuitive manner.For the encoding itself, there are several options. Postgres currently supports three
COPY
encodings:csv
,text
, andbinary
. Postgrex'scsv
encoding is messy (as seen above) and supporting it would essentially requirepostgrex
to either reimplement, or depend on, a CSV library. And Postgres'stext
encoding is an entire custom format that still requires the overhead of stringification.On the other hand, Postgres's
binary
encoding, a.k.a. thePGCOPY
format described here, is a binary container format wrapping the regular binarysend
/recv
wire encoding of Postgres types—which Postgrex already knows how to encode/decode. This has many advantages: it has low on-the-wire size; it can be generated by building iodata from existing binaries with no processing or copying; and it doesn't require writing any new encoding logic for the primitives of the format, only for the (very simple) container format. (Assuming, that is, that you're putting this logic in Postgrex. Since Postgrex doesn't directly expose its wire encoding primitives for use by other libraries, a third-party library implementing thePGCOPY
format would have to duplicate a lot of Postgrex's work!)It would make sense to me to approach this problem like so:
postgrex_wire
);Postgrex.PGCOPY
(which could be part of Postgrex, or its own library where bothpostgrex
andpostgrex_pgcopy
depend onpostgrex_wire
), wherePostgrex.PGCOPY
presents an interface with conventional functions such asdecode(!)/1,2
,encode_to_iodata/1
,encode_to_stream/1
, etc.Postgrex.stream
to take anencoder
ordecoder
option (a necessary generalization of the currentdecode_mapper
option, given that the PGCOPY format requires a header and trailer.)Postgrex.PGCOPY
would be a validencoder
ordecoder
. Other modules, e.g.PostgresCSV
, could also be implemented for use here if users want them. (These other modules be necessary if attempting to interface with only semi-PG-wire-compatible DBMSes like CockroachDB or Aurora Postgres.)