Queueing RolloutRequests uses 22% - 25% of GameSearcher thread time

arr28 commented 10 years ago

Queueing RolloutRequests from the GameSearcher thread to the RolloutProcessor threads takes 22% - 25% of that thread's time. This is true when using either LinkedBlockingQueue or ArrayBlockingQueue (which is widely held to be the fastest standard implementation).

LMAX Disruptor makes grand claims about its performance at this task. I'm going to put it through its paces in our environment and see what happens.

arr28 commented 10 years ago

See reference material for implementation.

(Wishful thinking: Perhaps it will reduce the wide variability in timed performance - thus allowing better sample size determination.)

SteveDraper commented 10 years ago

I'm amazed the overhead is so high - do you know why (or have any pointers to good information sources you've found)?

Steve

On Tue, May 6, 2014 at 4:03 PM, Andrew Rose notifications@github.comwrote:

See reference material for implementationhttps://github.com/LMAX-Exchange/disruptor/wiki/Getting-Started .

(Wishful thinking: Perhaps it will reduce the wide variability in timed performance - thus allowing better sample size determination.)

— Reply to this email directly or view it on GitHubhttps://github.com/SteveDraper/ggp-base/issues/50#issuecomment-42358571 .

SteveDraper commented 10 years ago

Ok, I've scanned the paper and it makes sense. I'm still amazed the overheads are as high as you found them to be, but its a good thing to find! Although I doubt our environment will benefit to the degree they suggest it should be worth something. Let me know how you get on.

I'll be pushing a bunch more fast animator changes later, which will be the final batch for now. I found a nice further optimization that has given us another 25% or so, but I'll summarize in email later after I'm done testing and have closed the feature.

Steve

On Tue, May 6, 2014 at 4:01 PM, Andrew Rose notifications@github.comwrote:

Queueing RolloutRequests from the GameSearcher thread to the RolloutProcessor threads takes 22% - 25% of that thread's time. This is true when using either LinkedBlockingQueue or ArrayBlockingQueue (which is widely held to be the fastest standard implementation).

LMAX Disruptor makes grand claims about its performance at this task. I'm going to put it through its paces in our environment and see what happens.

— Reply to this email directly or view it on GitHubhttps://github.com/SteveDraper/ggp-base/issues/50 .

arr28 commented 10 years ago

This is probably a false alarm. I suspect that queueing was taking so long because the queue was full and applying back pressure. I'm going to try again with sample size forced to 1 so that the rollout threads should be mostly idle and the queue will be always empty.

arr28 commented 10 years ago

A very quick test of my underpowered system (which isn't really suitable for performance measurements) shows that, when doing just a single sample per rollout request...

ABQ.put() takes ~16% of the search thread's CPU time
Disruptor takes ~6%

SteveDraper commented 10 years ago

Not huge, but enough to be worthwhile I think, especially as the gain probably goes up with more threads and/or more stalling due to mismatched rollouts/search rates.

Steve

On Wed, May 7, 2014 at 4:18 AM, Andrew Rose notifications@github.comwrote:

A very quick test of my underpowered system (which isn't really suitable for performance measurements) shows that, when doing just a single sample per rollout request...

ABQ.put() takes ~16% of the search thread's CPU time

Disruptor takes ~6%

— Reply to this email directly or view it on GitHubhttps://github.com/SteveDraper/ggp-base/issues/50#issuecomment-42405618 .

arr28 commented 10 years ago

On my regular system, after 20 moves of Alquerque (15s/move), just doing a single rollout per request...

ABQ takes ~13% of the search thread's CPU time
Disruptor takes ~3%

Looks like Disruptor is definitely worthwhile. A few things to sort out.

Allow proper disabling of greedy rollouts (rather than always disabling them).
Allow the rollout processors to be stopped.
Re-instate proper thread naming, daemonizing, etc. for the rollout threads

And some longer term possibles to mention.

Potentially remove the RolloutRequestRef hack and do proper pre-allocation and re-use of RolloutRequest objects (which should be a further perf. win). However, if we do that, we'll also need to use the Disruptor for passing the object back to the main thread for updating the stats - and there isn't really a convenient way of doing that with the Disruptor codebase.
Use the raw disruptor primitives (or just their techniques with our own code) to avoid multi-casting the request to all threads and having most threads ignore most of the requests.

SteveDraper commented 10 years ago

Sounds good. This cycle should see a decent overall performance improvement all round, and we've managed to focus on that pretty well.

On Wed, May 7, 2014 at 3:25 PM, Andrew Rose notifications@github.comwrote:

On my regular system, after 20 moves of Alquerque (15s/move), just doing a single rollout per request...

ABQ takes ~13% of the search thread's CPU time

Disruptor takes ~3%

Looks like Disruptor is definitely worthwhile. A few things to sort out.

Allow proper disabling of greedy rollouts (rather than always disabling them).

Allow the rollout processors to be stopped.

And some longer term possibles to mention.

Potentially remove the RolloutRequestRef hack and do proper pre-allocation and re-use of RolloutRequest objects (which should be a further perf. win). However, if we do that, we'll also need to use the Disruptor for passing the object back to the main thread for updating the stats - and there isn't really a convenient way of doing that with the Disruptor codebase.

Use the raw disruptor primitives (or just their techniques with our own code) to avoid multi-casting the request to all threads and having most threads ignore most of the requests.

— Reply to this email directly or view it on GitHubhttps://github.com/SteveDraper/ggp-base/issues/50#issuecomment-42478651 .

arr28 commented 10 years ago

Bad news - having tidied this all up, I've found a problem. The method I was using works well when the rollout processors are mostly idle, but works poorly when we keep them busy. I suspect that's all down to the choice of wait strategy in the Disruptor. There are 4 well-documented alternatives.

In my early testing, I found that BlockingWaitStrategy performed worse than ABQ.

I then switched to SleepingWaitStragegy. This works really well when the rollout threads are basically idle. It doesn't matter so much if it takes them a while to notice that they've got work to do, because they can still just about complete it without becoming a bottleneck. However, when using a sensible sample weight, the time taken for the rollout thread to notice the work contributes to the total work rate of the system. I didn't have time to fully investigate last night, but I saw that the 3 rollout threads were 50:50 doing useful work vs sleeping and the search thread was 20:80 useful:blocking.

The documentation quotes LockSupport.parkNanos(1) (at the heart of the SleepingWaitStrategy) as taking 60us on a typical Linux system but a quick test on my (underpowered) Windows laptop this morning shows it takes ~1200us == 1.2ms. If this was the only processing, it would cap the MCTS rate to 800 iterations per second (somewhat alleviated by the Disruptor's batch processor).

I think I previously tried with YieldingWaitStrategy and saw that it didn't work well, but I should probably have another go. Thread.yield() takes 70ns on my laptop.

arr28 commented 10 years ago

So, turns out that YieldingWaitStrategy is just fine. Doesn't use the Disruptor directly because it isn't quite flexible enough for our scenario (in particular, passing back to the original thread for the final phase of processing) but I've used their technique to good effect.

SanchoGGP / ggp-base

Queueing RolloutRequests uses 22% - 25% of GameSearcher thread time #50