citusdata / pg_shard

ATTENTION: pg_shard is superseded by Citus, its more powerful replacement
https://github.com/citusdata/citus
GNU Lesser General Public License v3.0
1.06k stars 63 forks source link

pg_shard may fail to mark shard placement as invalid under some circumstances #101

Open onderkalaci opened 9 years ago

onderkalaci commented 9 years ago

The bug happens when pg_shard fails to INSERT to shard placement and postgres is shut down or psql connection is closed before shard placement status is updated.

This is not easy to reproduce bug. But, if a sleep() function call is added to this line, reproducing becomes easy.

Assuming that sleep() is added, the bug can be reproduced with following steps:

  1. Create a cluster with 1 master, 2 workers
  2. Distribute table and create worker shards with replication factor 2
  3. Stop one of the worker nodes
  4. Connect to psql, and get its pid, _select pg_backendpid();
  5. Issue an INSERT on that psql session. During the INSERT (since we added a sleep, it takes at least the sleep seconds), execute shell command "_kill -9 pid_ofpsql"
  6. Restart both master and the stopped worker node.
  7. Connect to worker nodes and observe that one of the shards is divergent
  8. But shard placements on metadata has all STATE_FINALIZED status

The main problem here is that we do not execute remote commands and state status changes in an atomic way.

A possible Solution that we can try is to check whether _HOLDINTERRUPTS()/_RESUMEINTERRUPTS() works. Also, check if these function call pair has any drawbacks.