basho / riak_pipe

Riak Pipelines
Apache License 2.0
162 stars 60 forks source link

NodeCheckService should be configurable for qcover_fsm [JIRA: RIAK-2409] #85

Closed kellymclaughlin closed 8 years ago

kellymclaughlin commented 10 years ago

The NodeCheckService used by riak_pipe_qcover_fsm is hard coded to riak_pipe here. This service is passed to node_watcher to determine if a node is up and eligible to service a cover request. Since riak_pipe is a foundational component for other services (e.g. Riak's mapreduce) it makes sense that the service used in the determination of a node being up or down should be configurable and use a default value of riak_pipe instead of being hard coded.

An example should make the reasoning more clear. In the case of Riak's mapreduce, riak_pipe handles the infrastructure of the request, vnode planning, results accumulation, etc., but the specialized work for mapreduce are riak_kv operations such as reading a value. When initiating a mapreduce job using the qcover_fsm it would be preferable to check the status of the riak_kv service for each node rather than the riak_pipe service. The riak_kv service is registered after the riak_pipe service during Riak startup and is stopped after the riak_kv service has stopped during Riak shutdown.

Sometimes the time difference between riak_kv stopping and riak_pipe stopping during shutdown can be significant as riak_kv works to close files and safely shut down. This lag in time was the source of a long-running problem with the loaded_upgrade riak_test module. The test consistently failed due to issues with the results from the mapreduce workload while a node was being shut down. The problem was the queries executed successfully, but the results were incomplete based on what was expected because some requests were sent to nodes whose riak_pipe service was still up, but whose riak_kv service was stopped. The read requests that comprised part of the mapreduce job were unable to complete on those nodes and the result was incomplete final results that did not meet the expectation of the loaded_upgrade test. The mapreduce work was removed from the test as part of this PR and has passed as expected since then.

jonmeredith commented 10 years ago

My hero.

bashopatricia commented 8 years ago

create jira issue