Skip proactively wake up TaskWorkers for cpu backend

pdeng6 commented 9 months ago

This pr suggests to skip the proactively wake up TaskWorkers since it causes performance drop in CPU constraint moment.

Environment:

Hardware: Intel Icelake 2 sockets/80 physical cores/160 logical cores platform

OS: CentOS 8 Stream with kernel version 6.2

Benchmark: ./lc0 benchmark -b eigen --threads=$num_threads -w b30e742bcfd905815e0e7dbd4e1bafb41ade748f85d006b8e28758f1a3107ae3 --num-positions=1

Problems:

With latest lc0 code, it's observed sometimes the benchmark hit bad case in the moment CPU resource is constraint, and it's due to proactively wake up TaskWorkers.

When CPU resource is constraint, proactively wake up could make SearchWorker off CPU and TaskWorkers on CPU, unfortunately there is nothing in pickingtasks, so that TaskWorkers will be busy waiting but actually starving. Even if they yield, TaskWorkers are probably picked up by OS scheduler than SearchWorker since they slept for a long time.

Here, to simulate the CPU constraint moment, we use taskset to specify core ids that the count is equal to num_threads, e.g. taskset -c 0-7,80-87 -b eigen --threads=16 -w b30e742bcfd905815e0e7dbd4e1bafb41ade748f85d006b8e28758f1a3107ae3 --num-positions=1

The result is in Table 1, the bad case performance drop is 60% and 78% compare with normal case.

Table 1: Performance in CPU constraint moment <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

	--threads=16	--threads=60
Normal case	1038	2129
Bad case	417	476
Bad/Normal - 1	-60%	-78%

Bad case: many proactively wakeups cause TaskWorker busy waiting but staving Normal case: not so many proactively wakeups cause TaskWorker busy waiting but staving

Solution in this Change:

Here we suggest to skip the proactively wake up, since there is thread notification when a Node is placed in pickingtasks, the TaskWorker should be waked up soon after that.

With this solution, in the CPU resource constraint moment, the bad case is avoided, normal case is even improved, see @ Table 2.

Table 2: Performance in CPU constraint moment if skip proactively wakeup <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

| --threads=16 | --threads=16 | --threads=16 | --threads=60 | --threads=60 | --threads=60 -- | -- | -- | -- | -- | -- | -- | Base | Skip | Skip/Base - 1 | Base | Skip | Skip/Base - 1 Normal case | 1038 | 1131 | 9% | 2129 | 2699 | 27% Bad case | 417 | 1131 | 171% | 476 | 2699 | 467%

In the moment CPU resource is not constraint, this change doesn't introduce regression according to evaluation with thread# 16 and 60

Table 3: Performance in CPU not constraint moment if skip proactively wakeup <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

| --threads=16 | --threads=16 | --threads=16 | --threads=60 | --threads=60 | --threads=60 -- | -- | -- | -- | -- | -- | -- | Base | Skip | Skip/Base - 1 | Base | Skip | Skip/Base - 1 Score | 2397 | 2413 | 1% | 5485 | 5487 | 0%

pdeng6 commented 9 months ago

Hi @borg323 ,

This PR is due to we found some performance bad case, could you please take a look?

Thanks Pan

borg323 commented 9 months ago

For cpu, we recommend setting task workers to 0, and in fact this is what #1930 does. Can you add --task-workers=0 to your tests?

pdeng6 commented 9 months ago

Thanks, just tried --task-workers=0 with several configurations.

Right, it fixes the bad case for cpu when constraint, can be seen in C/D/G/H in Table 4.

Table 4: Performance in CPU constraint moment

| A | B | C | D | E | F | G | H -- | -- | -- | -- | -- | -- | -- | -- | -- | --threads=16 | --threads=16 | --threads=16 | --threads=16 | --threads=60 | --threads=60 | --threads=60 | --threads=60 | Base+4TaskWorker | Skip+4TaskWorker | Base+0TaskWorker | Skip+0TaskWorker | Base+4TaskWorker | Skip+4TaskWorker | Base+0TaskWorker | Skip+0TaskWorker Normal case | 1038 | 1131 | 1135 | 1135 | 2129 | 2699 | 2688 | 2690 Improvement to Base+4TaskWorker | - | 9.0% | 9.3% | 9.3% | - | 26.8% | 26.3% | 26.4% Bad case | 417 | 1131 | 1135 | 1135 | 476 | 2699 | 2688 | 2690 Improvement to Base+4TaskWorker | - | 171.2% | 172.2% | 172.2% | - | 467.0% | 464.7% | 465.1%

Base + 0TaskWorker: --task-workers=0, and keep proactively wakeup taskworkers
Skip + 0TaskWorker: --task-workers=0, and skip proactively wakeup taskworkers

For cpu when resource is not constraint, according to my evaluation in Table 5, skip proactively wakeup plus taskworker# is 0 is slightly better, can be seen @ column D and H.

Table 5: Performance in CPU not constraint moment <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

| A | B | C | D | E | F | G | H -- | -- | -- | -- | -- | -- | -- | -- | -- | --threads=16 | --threads=16 | --threads=16 | --threads=16 | --threads=60 | --threads=60 | --threads=60 | --threads=60 | Base+4TaskWorker | Skip+4TaskWorker | Base+0TaskWorker | Skip+0TaskWorker | Base+4TaskWorker | Skip+4TaskWorker | Base+0TaskWorker | Skip+0TaskWorker Score | 2397 | 2413 | 2416 | 2428 | 5485 | 5487 | 5456 | 5703 Improvement to Base+4TaskWorker | - | 0.7% | 0.8% | 1.3% | - | 0.0% | -0.5% | 4.0%

Base + 0TaskWorker: --task-workers=0, and keep proactively wakeup taskworkers
Skip + 0TaskWorker: --task-workers=0, and skip proactively wakeup taskworkers

So, I'm wondering, base on https://github.com/LeelaChessZero/lc0/pull/1930, shall we setup a condition for proactively wakeup like this:

void SearchWorker::ProactivelyNotifyTaskWorkers() {
  if (task_workers_ <= 0 || search_->network_->IsCpu())
    return;

  // While nothing is ready yet - wake the task runners so they are ready to
  // receive quickly.
  Mutex::Lock lock(picking_tasks_mutex_);
  task_added_.notify_all();
}

so that we can achieve column G/H in Table 5, and avoid the bad case even somewhere using --task-workers=nonzero. How do you think of it? I'm glad to send a patch if it is reasonable.

Thanks Pan

borg323 commented 9 months ago

Thanks for testing. The suggested change makes sense, but we should wait until #1930 is accepted, there may be changes to the interface.

pdeng6 commented 9 months ago

Thanks, looking forward #1930 landing :)

borg323 commented 9 months ago

BTW, since you have tested this extensively, what do you think about the task workers default in #1930 for the gpu case? For a multi gpu setup there may be one more threads than the number of gpus, so larger values are not out of the question, although many of them will be waiting for the gpu. https://github.com/LeelaChessZero/lc0/blob/f928fc6a91e4a04b22cf73deb0e185ec49a54fe5/src/mcts/search.h#L224-L227

pdeng6 commented 9 months ago

I think your setup is reasonable, as my understanding task workers aim to speed up critical region "gather minibatch", its value should depend on:

1: the "work" can be parallelized 2: CPU resource available.

For gpu case, task worker#:

lowest value: -1, in case search worker# is even more than CPU core#, since search worker are busy-waiting, either waiting for critical region, or gpu, they are using CPU (please feel free to correct me if it's not busy waiting for gpu), in this situation, no more CPU resource, given more task worker# won't help.

higher value: allocate search worker, then give remained cores to task worker(up to 4 which is the default before the change) the only exception is multi-tenancy, if there are other cpu-intensive apps, more task worker# may not as helpful as expected.

pdeng6 commented 9 months ago

@borg323, any plan to land #1930 ? :)

pdeng6 commented 9 months ago

@borg323 ping~, looks #1930 is almost good to go?

pdeng6 commented 9 months ago

@borg323

Change updated, could you please take a look?

Thanks Pan

pdeng6 commented 9 months ago

Thanks @borg323 Since I don't have w access, could you please help to merge it?

Thanks Pan

pdeng6 commented 9 months ago

@borg323, ready to merge it? thanks! :)

borg323 commented 9 months ago

Was a bit busy the last few days...

pdeng6 commented 9 months ago

Understand, thanks a lot.

LeelaChessZero / lc0