Add support for \copy command via script/trigger

jasonmp85 commented 9 years ago

The script provides an easy way for users to COPY to a distributed table. It accepts a filename and table name, prepares the table for COPY, performs the COPY, then outputs the number of rows copied. Flags are supported to enable various COPY OPTIONS in the underlying SQL statement.

This branch still needs Makefile changes, better documentation, and a unit test, but I wanted to get the code out there to kick off a review. I'll be pushing the remaining changes here as they come up.

Fixes #61

Code Review Tasks

[x] Switch back to providing sequence to function
[x] Make sequence argument optional
[x] Add regression test for sequence behavior
[x] Switch to distributed table for regression tests
[x] Add COPY test with partial failure
[x] Modify script to use sequence-based UDF
[x] Remove as much shell interpolation as possible (favor psql variables)
[x] Swap order of arguments in usage shell function

jasonmp85 commented 9 years ago

High-level description of the approach: prepare_distributed_table_for_copy accepts a table and creates a function specifically for that table based on its columns. Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW, but these types of field accesses aren't permitted by pg_shard (at the moment). It rejects them as not being constant expressions.

So the only type of statement that could work is: INSERT INTO foo VALUES ($1, $2) USING NEW.bar, NEW.baz. Since the dynamic text of that statement isn't in the INSERT string but in the function itself (in the USING clause), this necessitates generating a complete function. So that's what we do.

Once the function has been generated for the given table, it is installed as a trigger on a temporary table exactly like the target table. Then any COPY executing against the temporary table will be redirected to the distributed table in a manner accepted by pg_shard.

sumedhpathak commented 9 years ago

Ideally we'd use INSERT INTO foo VALUES (($1).*) USING NEW or even INSERT INTO foo VALUES ($1.bar, $1.baz) USING NEW,

Is USING an INSERT keyword? I don't see it in the Postgres insert documentation?

jasonmp85 commented 9 years ago

It's part of the EXECUTE statement in PL/pgSQL.

jasonmp85 commented 9 years ago

We don't frequently write this heavily in PL/pgSQL, so here is a brief primer of features used to help in review:

Variables must be listed in DECLARE blocks at the top of functions (though you can also have nested blocks in the function itself to have limited scope variables)
They can be marked constant to prevent reassignment
Double dollar signs can be used to delimit strings (and they can have a token within the dollar signs to allow nested quoting; quotes don't end until a matching dollar symbol is found)
format has %I and %L specifiers for "identifier" and "literal", respectively. Helps format things that need to appear in an SQL statement as an identifier (INSERT INTO identifier) or literal (VALUES ('literal')
SELECT INTO variable allows the result of a SELECT to be assigned to a variable. STRICT means we expect a single row
|| is string concatenation
EXECUTE is basically eval: it executes a string which may contain positional parameters ($1, $2, etc.) bound at runtime to values supplied in a USING clause.

sumedhpathak commented 9 years ago

I had three higher-level things:

I think I am preferring the previous solution of giving a sequence to the function, which it increments. Is there a reason to not do this? i.e., Is there a use-case for the write-through mode in the function apart from being able to install another trigger on the temp-table to count the rows? If not, then I think just passing a sequence makes sense. If no sequence is specified the function doesn't increment it maybe?

I like that the function returns the tablename, so it makes it easier to de-couple the calls and run them manually.
Could you add a regression test which supplies a sequence (if you make the above change), and then select's curr_val to validate the function incremented the sequence?
Could you also add a regression test after COPY'ing some good rows followed by junk data to validate that the correct rows are recorded as inserted? (Maybe also followed by a select to validate that we inserted the previous 'x' good rows?

jasonmp85 commented 9 years ago

On each of your high-level points:

I agree it is more straightforward to explain, but think it's more modular. It just routes the INSERT and knows how to do nothing else. Why pass a sequence? What's special about that, as opposed to—say—passing a function name and having it call that after each row?

Ultimately, I'm going to change it, but do think the "INSERT-only trigger with an option to chain more triggers" is a more composable building block than a "INSERT trigger that can update a sequence, if supplied"
Yes
This is a little trickier. You might have overlooked that I'm not actually using a distributed table in my test. I'm just making a proxy to a normal table. This was because I wanted the test to check the INSERT logic alone and not actually set up a pg_shard table.

I find it best to have the behavior that if a feature breaks, only the unit tests related to that feature break. It helps diagnosis. By using a real distributed table, if table distribution breaks, a bunch of tests break, which doesn't aid in finding the problem.

Of course I'm still playing devil's advocate. I think the INSERT trigger is enough of a hack that we definitely want a test to actually test it against a distributed table. So I'll update that too, even though it'll complicate the test.

jasonmp85 commented 9 years ago

Since you didn't mention the naming, I'm assuming you're fine with the names of:

The shell script
The UDF

I'll make a checklist at the top of this PR to keep track of the things I need to do and when they're done it sounds like this is a :shipit:.

I had one concern: because the UDF returns the proxy table name, I should be able to store it in a variable and interpolate it, or at least I would were I using COPY (server-side). From the documentation (emphasis mine):

The syntax of this command is similar to that of the SQL COPY command. All options other than the data source/destination are as specified for COPY. Because of this, special parsing rules apply to the \copy command. In particular, psql's variable substitution rules and backslash escapes do not apply.

This is why I'm directly interpolating the table name myself in the shell script, which I don't like. Though \copy can be slower than COPY (data must pass through the client), it has two benefits:

It understands relative paths (@rsolari had mentioned this being annoying with the previous solution)
You can easily connect from a remote machine and copy over a file from that machine rather than doing it directly on the master
Permissions are easier to understand (your user must be able to read the file, not the server)

So COPY might be something we explore if the performance of \copy is an issue, but otherwise the usability of the latter wins out so I'm sticking with it. I am going to update this to avoid so much direct interpolation, but wanted to call out the tradeoffs in either direction.

sumedhpathak commented 9 years ago

@jasonmp85 I am OK with the names of both the script and the UDF.

citusdata / pg_shard

Add support for \copy command via script/trigger #82