copy_to_distributed_table in contrib package fails to mark inactive shards

metdos commented 9 years ago

If I shutdown(ctrl + c) copy_to_distributed_table while it is working, it fails to mark shards on closed node as inactive. If I wait copy_to_distributed_table to complete, then it marks shards inactive on closed node as properly.

Here are steps to replicate problem;

Start a 2+1 pg_shard cluster.
Create customer_reviews table as in pg_shard documentation and create shard placements.
Shutdown one of the worker nodes.
Start copy_to_distributed_table /usr/local/pgsql/bin/copy_to_distributed_table -CH -n NULL customer_reviews_1998.csv customer_reviews
And shutdown (ctrl + c) it, before it completes.
Check pgs_distribution_metadata.shard_placement and see none of shard placements are marked as inactive.
Run queries…
1. Run select count(*) from customer_reviews;
2. Start closed worker node.
3. Again run select count(*) from customer_reviews; and see results are different.

onderkalaci commented 9 years ago

Hey @jasonmp85,

I examined the bug thoroughly. It seems the problem (or a very close problem) may lead to some other bugs too.

First, let me illuminate the problem. Basically, our approach does the following:

Create a temporary table.
Add a trigger to the temporary table.
On each INSERT to the temporary table, via the trigger, INSERT to the distributed table.
Execute COPY to the temporary table. The COPY command triggers the INSERT to the distributed table.

Now, how this bug happens as follows:

Stop one of the worker nodes.
During COPY command, the stopped worker nodes is marked as STATE_INACTIVE. Thus, it does not try to INSERT to the shard placements that are on the stopped worker node.
Concurrent queries read the metadata's previous version, in which all shard placement states are STATE_FINALIZED. (I guess, it is due to MVCC )
If COPY fails, the transaction that COPY is executed inside is rolled backed.
The shard placements that are marked as STATE_INACTIVE marked as STATE_FINALIZED again.
Thus, we end up shard placements which are divergent, but, all of which has finalized states.

This problem can also be observed in different forms. Such as the following:

Stop one of the worker nodes.
During COPY command, the stopped worker nodes is marked as STATE_INACTIVE.
Concurrent queries read the metadata's previous version, in which all shard placement states are STATE_FINALIZED. (I guess, it is due to MVCC ). Thus, concurrent queries can read from STATE_INACTIVE shard placements.

How can we approach to the solution?

I tried to find some ways to handle this bug. What do you think about them? Which one should I follow? Or can you suggest any other way?

Using COPY makes things difficult to handle. Mainly it is because we are on a single transaction on the master node, but, we execute too many transactions on the worker nodes. Rolling back all the INSERTs is not simple. So, one approach could be to change COPY to consecutive INSERTs, which in turn may decrease the overall performance. This may require a lot of code change too. (Handling COPY parameter etc.)
Change copy_to_distributed_table, use some kind of DO BLOCK or pl/pgsql functions instead of this part. Then, write a code to iterate over the file. On each iteration, copy a single line from the file and save it to a temporary file. Lastly, COPY from that temporary file. This method incurs high overhead, but, at least we do not lose the gain of using COPY or we don't need to change too many lines of codes. We need to consider newline character to parse the file, which can be done easily.
I don't know whether is it doable or not, but, if we can disable rollback for the transaction that is started by COPY command, we can bypass this problem.
Disable interrupts during execution of the script. Well, we still prone to the bug if the server is closed or so. Also, from usability, it is also not a good approach. If it takes too much time to load the data (which is realistic), and user wants to stop it, he wouldn't be able to do. This is easiest to implement. If we are going to implement first-class copy soon, we can consider this and inform users about this current bug.

onderkalaci commented 9 years ago

Hey @jasonmp85, Changing the value of ON_ERROR_ROLLBACK has no effect on our case. I read this post which explains the effects of the parameter in detail. Despite what is written on the post, I tested with changing the value of it, but as I expected the results didn't change.

To summarize what I observed:

First, it is not a Postgres feature, and there is no way you can instruct Postgres itself to ignore errors inside of a transaction or disable ROLLBACK.
ON_ERROR_ROLLBACK is designed for error handling during execution consecutive commands inside a transaction. However, we only have a single transaction associated with \COPY command.
I thought of changing copy_to_insert that it starts a transaction. By that way, the metadata is updated inside that transaction and quitting the script does not lead to this bug. However, as explained here, it is not possible to start a transaction inside a PL/pgSQL function.
Also, I thought of executing the triggers as separate transactions with the same motivation above. However, again it is not possible as explained here.

So, it seems that we cannot solve this problem with playin ON_ERROR_ROLLBACK.

citusdata / pg_shard

copy_to_distributed_table in contrib package fails to mark inactive shards #95

How can we approach to the solution?