gchq / stroom

Stroom is a highly scalable data storage, processing and analysis platform.
https://gchq.github.io/stroom-docs/
Apache License 2.0
431 stars 55 forks source link

Killing a master node results in dangling tasks #2330

Open p-kimberley opened 3 years ago

p-kimberley commented 3 years ago

Problem

Killing a master node results in any unprocessed stream tasks to be left in an unprocessed state. If these tasks were not assigned prior to the node being killed, they are left unprocessed with no evident way to process them.

Configuration

Procedure

  1. Kick off a stream processor job with approx. 60 streams
  2. Wait for all nodes to begin processing one or more tasks
  3. Kill the elected master node
  4. Observe how all currently processing nodes reach "Complete" status
  5. Observe the remaining tasks remaining at "Unprocessed" status
  6. Wait for original master node to re-join and be re-elected
  7. Observe how the unassigned tasks still do not get processed

At this point, the remaining tasks are usually unassigned to a node. If this is the case, there doesn't appear to be a way of getting them to process.

Sometimes the tasks are assigned to a node, but they do not get processed straight away. If a node is then rebooted, the remaining tasks get re-assigned and processing commences.

Expected behaviour

If a master node is killed and a master election occurs, the new master node should assign tasks and have them start processing.

Affected versions

master and possibly v5

p-kimberley commented 3 years ago

Subsequent tests have succeeded, so this warrants further investigation as to the conditions that may cause this behaviour to occur. Sample number of tasks was fairly small, so further testing should involve a greater number of streams, in order to attempt to reproduce this issue.

at055612 commented 1 year ago

@stroomdev66 I seem to remember a change going in to re-allocate un-assigned tasks? I.e. can this be closed?