citusdata / pg_shard

ATTENTION: pg_shard is superseded by Citus, its more powerful replacement
https://github.com/citusdata/citus
GNU Lesser General Public License v3.0
1.06k stars 63 forks source link

Add support for \copy command via script/trigger #82

Closed jasonmp85 closed 9 years ago

jasonmp85 commented 9 years ago

The script provides an easy way for users to COPY to a distributed table. It accepts a filename and table name, prepares the table for COPY, performs the COPY, then outputs the number of rows copied. Flags are supported to enable various COPY OPTIONS in the underlying SQL statement.

This branch still needs Makefile changes, better documentation, and a unit test, but I wanted to get the code out there to kick off a review. I'll be pushing the remaining changes here as they come up.

Fixes #61

Code Review Tasks

jasonmp85 commented 9 years ago

High-level description of the approach: prepare_distributed_table_for_copy accepts a table and creates a function specifically for that table based on its columns. Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW, but these types of field accesses aren't permitted by pg_shard (at the moment). It rejects them as not being constant expressions.

So the only type of statement that could work is: INSERT INTO foo VALUES ($1, $2) USING NEW.bar, NEW.baz. Since the dynamic text of that statement isn't in the INSERT string but in the function itself (in the USING clause), this necessitates generating a complete function. So that's what we do.

Once the function has been generated for the given table, it is installed as a trigger on a temporary table exactly like the target table. Then any COPY executing against the temporary table will be redirected to the distributed table in a manner accepted by pg_shard.

sumedhpathak commented 9 years ago

Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW,

Is USING an INSERT keyword? I don't see it in the Postgres insert documentation?

jasonmp85 commented 9 years ago

It's part of the EXECUTE statement in PL/pgSQL.

jasonmp85 commented 9 years ago

We don't frequently write this heavily in PL/pgSQL, so here is a brief primer of features used to help in review:

sumedhpathak commented 9 years ago

I had three higher-level things:

jasonmp85 commented 9 years ago

On each of your high-level points:

jasonmp85 commented 9 years ago

Since you didn't mention the naming, I'm assuming you're fine with the names of:

I'll make a checklist at the top of this PR to keep track of the things I need to do and when they're done it sounds like this is a :shipit:.

I had one concern: because the UDF returns the proxy table name, I should be able to store it in a variable and interpolate it, or at least I would were I using COPY (server-side). From the documentation (emphasis mine):

The syntax of this command is similar to that of the SQL COPY command. All options other than the data source/destination are as specified for COPY. Because of this, special parsing rules apply to the \copy command. In particular, psql's variable substitution rules and backslash escapes do not apply.

This is why I'm directly interpolating the table name myself in the shell script, which I don't like. Though \copy can be slower than COPY (data must pass through the client), it has two benefits:

So COPY might be something we explore if the performance of \copy is an issue, but otherwise the usability of the latter wins out so I'm sticking with it. I am going to update this to avoid so much direct interpolation, but wanted to call out the tradeoffs in either direction.

sumedhpathak commented 9 years ago

@jasonmp85 I am OK with the names of both the script and the UDF.