Factual / skuld

Distributed task tracking system.
300 stars 13 forks source link

Make tasks instantly claimable on vnode leaders #96

Closed eric closed 10 years ago

eric commented 10 years ago

Right now a newly elected vnode will not have any available tasks until the scanner runs to add them to the queue.

If leaders are unstable, it will cause no tasks to ever be claimable. I believe that the stress tests are demonstrating that currently.

I attempted (wrongly) to fix it in #80 but that would cause a full database rescan for a vnode every time it is elected (which seems inefficient).

I added retries to all claims in tests in #92, but it has not resolved the problem.

We need to come up with some sort of queue data structure that will allow us to present a single ordered queue that encompasses all of the vnodes on a node that are currently the leader.

We've discussed some of this here:

Skew binomial heaps have O(1) merge and log(n) dequeue/update, but it might be easiest just to have N prioqueues and do an O(vnodes) scan over them.

eric commented 10 years ago

I realized that right now we have some degree of randomization as to what task is claimed first: The node that was asked for a claim tries to first service the claim from the local vnodes for which it the leader. This could cause higher priority tasks to not get completed first if a client continues to ask the same node for work.

With that realization that it already isn't exact, I would like to propose a simple solution to this problem as a first step forward:

Have a queue per vnode and when a claim comes in, iterate over a shuffled list of vnodes and attempt to claim from each of them.

aphyr commented 10 years ago

WFM! I suggest using rand-nth, then a linear probe around the set of vnodes.

eric commented 10 years ago

I was planning on using loop and shuffle like claim! does so I could try all of the vnodes on a node.

aphyr commented 10 years ago

That's fine.

aphyr commented 10 years ago

(claim does that because it's going across the network--you're talking to local services so the latency costs will be less variable)

eric commented 10 years ago

This is in master now.