basho / riak_ee-issues

Issue tracking for Riak Enterprise
3 stars 4 forks source link

Multiple riak_repl2_fscoordinator_sup workers per cluster when adding nodes #30

Open nerophon opened 8 years ago

nerophon commented 8 years ago

A customer finds multiple fullsync coordinator workers running simultaneously on each of two clusters. This causes multiple fullsync schedules to run concurrently; the actual fullsync operations may or may not overlap, but each coordinator is active and has its own timer.

This state is reproducible as follows:

  1. Set up two clusters, A & B.
  2. Set up REPL and connect them (cluster manager 0.0.0.0:9080).
  3. Set fullsync_on_connect to true (unclear whether this step is required).
  4. Push continuous load onto cluster A.
  5. Start fullsync with A as source and B as sink.
  6. While fullsync is running, join one or more new nodes to A.
  7. On all nodes riak attach and run supervisor:count_children(whereis(riak_repl2_fscoordinator_sup))..
  8. Observe that worker count > 0 on more than one node. In my test, it was on the original coordinator and also the newly joined node.

The workaround for this issue is to manually kill all riak_repl2_fscoordinator_sup processes as follows:

  1. stop & disable fullsync
  2. wait a few minutes
  3. on each node attach and run: Pid = whereis(riak_repl2_fscoordinator_sup). then erlang:exit(Pid,kill)..
  4. wait a few minutes
  5. enable & start fullsync

The symptoms of this issue are extremely slow fullsync operations, cluster overload / slowness, and fullsync activity in the logs when no fullsync ought to be running.

nerophon commented 8 years ago

Internal duplicate: https://github.com/basho/riak_repl/issues/748