Closed jasonmp85 closed 9 years ago
I just wanted to note that there is a concurrency bug in the script version that I have been using.
The problem is that the process table, pg_proc
, isn't MVCC-safe. When we do REPLACE FUNCTION
, we invalidate the cached copies of all of the old versions of the function, and in-flight copies fail.
This problem only occurs when we are doing REPLACE FUNCTION
calls, so if you create the function once, you'll be fine. I just wanted to make sure you're aware of the concurrency issue before you productize the trigger.
@rsolari — Does your copy of this script have a call to pg_advisory_lock
? It was added to guard the CREATE OR REPLACE
call because concurrent modifications caused problems…
Or is this a separate issue? It sounds as though you're saying the REPLACE
call trips up in-flight executions of the trigger that were otherwise happy…
@rsolari — Does your copy of this script have a call to pg_advisory_lock? It was added to guard the CREATE OR REPLACE call because concurrent modifications caused problems…
Yes, there is a lock around CREATE OR REPLACE
.
Or is this a separate issue? It sounds as though you're saying the REPLACE call trips up in-flight executions of the trigger that were otherwise happy…
Yep, that's what's happening. Each local process' cached version of the function gets invalidated, and in-flight copies fail.
@rsolari So the current (short-term) approach is to make parallel use safe, but still have the same failure mode (failure meaning the COPY
failed because of something beyond our control, not because of a bug in parallel access).
I was imagining you could do something like:
COPY
) processCOPY
to ingest its fileAssuming we provide a multiprocess-safe COPY
-compatible function that returns the number of rows successfully copied, what are you missing? Is your desired workflow significantly different from the above?
That workflow sounds like exactly what we want.
Hey @rsolari — I know you guys had some issues with the existing script apart from what you've said here, namely:
OPTIONS
provided to COPY
The pull request (#82) I opened has a script that allows relative paths and supports most OPTIONS
for COPY
, but I was wondering if you also need the ability to explicitly specify what columns are in the input (if, for instance, you want to omit certain columns in your input file). This feature shows up in the COPY
syntax as the ( column_name [, ...] )
clause. Do you need support for this right now?
Thanks for checking in. We don't need support for specifying columns right now.
We only need support for specifying FORMAT
as text, the DELIMITER
, and the NULL
character. Here's our COPY
:
COPY my_table FROM :'filename' WITH(FORMAT text, DELIMITER ',', NULL '\N');
I looked over #82, and it looks like all of things we'd want supported are supported, which is awesome.
I've written a
COPY
trigger used by some of our customers which turns aCOPY
command into manyINSERT
commands using a temporary table with the same schema as a sharded table.Unfortunately it's hardcoded to a specific schema. The first step towards users being able to
COPY
to a sharded table is to generalize this script for any table.