tensor-accum-0.17/dev+/uniform questions/discussions

Umsturz commented 5 years ago

Hi I tried to build your fastexit-tensor-accum+ on ubuntu 16.04. going by the steps in the readme, see below the error message. But the build fails with the following errors. Any idea how to fix this?

cmake --build . [ 3%] Built target gtest [ 7%] Built target gtest_main [ 9%] Building CXX object CMakeFiles/objs.dir/src/UCTSearch.cpp.o lz/src/UCTSearch.cpp:268:45: warning: unused parameter ‘thread_num’ [-Wunused-parameter] int thread_num) { ^ lz/src/UCTSearch.cpp: In member function ‘int UCTSearch::think(int, UCTSearch::passflag_t)’: lz/src/UCTSearch.cpp:860:18: error: converting to ‘std::queue<std::unique_ptr >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr; _Sequence = std::deque<std::unique_ptr, std::allocator<std::unique_ptr > >]’ backup_queue = {}; ^ lz/src/UCTSearch.cpp: In member function ‘void UCTSearch::ponder()’: lz/src/UCTSearch.cpp:944:18: error: converting to ‘std::queue<std::unique_ptr >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr; _Sequence = std::deque<std::unique_ptr, std::allocator<std::unique_ptr > >]’ backup_queue = {}; ^ At global scope: cc1plus: warning: unrecognized command line option ‘-Wno-mismatched-tags’ cc1plus: warning: unrecognized command line option ‘-Wno-ignored-attributes’ CMakeFiles/objs.dir/build.make:254: recipe for target 'CMakeFiles/objs.dir/src/UCTSearch.cpp.o' failed make[2]: [CMakeFiles/objs.dir/src/UCTSearch.cpp.o] Error 1 CMakeFiles/Makefile2:143: recipe for target 'CMakeFiles/objs.dir/all' failed make[1]: [CMakeFiles/objs.dir/all] Error 2 Makefile:149: recipe for target 'all' failed make: *** [all] Error 2

Build instructions from the readme:

sudo apt install clinfo && clinfo

git clone https://github.com/gcp/leela-zero cd leela-zero git submodule update --init --recursive

sudo apt install libboost-dev libboost-program-options-dev libboost-filesystem-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev

mkdir build && cd build cmake .. cmake --build . ./tests curl -O https://zero.sjeng.org/best-network ./leelaz --weights best-network

alreadydone commented 5 years ago

Yeah some compilers can't deal with this. I suggest changing backup_queue={}; to while(!backup_queue.empty()) { backup_queue.pop(); } in both think() and ponder(). Not sure if this is less efficient or not.

Umsturz commented 5 years ago

May I ask what compiler you are using? I now tried with gcc 5.4. and clang 3.8. Even after changing backup_queue={}; there is a new error with both compilers.

In file included from lz/src/OpenCL.cpp:36: lz/src/OpenCL.h:77:22: error: implicit instantiation of undefined template 'std::atomic' std::atomic m_occupied{0}; ^ /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:126:12: note: template is declared here struct atomic; ^ In file included from lz/src/OpenCL.cpp:36: lz/src/OpenCL.h:78:22: error: implicit instantiation of undefined template 'std::atomic' std::atomic idle_count{0}; ^ /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:126:12: note: template is declared here struct atomic; ^

alreadydone commented 5 years ago

I think the error indicates you need `#include in OpenCL.h. People compiled successfully on Ubuntu before; gcc 8.1.0 seems to be working.

Umsturz commented 5 years ago

Thank you it works with #include . Unfortunatly with multiple GPUS I dont see an improvement in ns.

alreadydone commented 5 years ago

Thank you for testing! There definitely remains work to be done. Can you tell me what GPUs you have, what other branches (gcp/next, ihavnoid/batch-full, ihavnoid/tensorcore, or else?) you are comparing my branch with, and what parameters (--batchsize, -t) you are using in each case?

alreadydone commented 5 years ago

You may now try https://github.com/alreadydone/lz/tree/tensor-accum-dev+. Tested on Google Cloud: 15270 pos/s with 4xV100, 256x19 net, and command ./leelaz -w ../../990.gz --batchsize 12 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --benchmark -v 200000 --worker 4

38865 n/s, 27054 pos/s with 8xV100, 256x19 net, and command ./leelaz --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32 --benchmark -v 200000 -w ../../990.gz

(both with 24vCPUs)

You can specify --batchsize and --worker separately for each GPU, e.g. for two GPUs (--gpu 0 --gpu 1) you can add --batchsize 12 --batchsize 16 --worker 3 --worker 2, etc. The -t parameter has no effect with this branch; the number of threads is simply the sum of worker threads over all GPUs.

Umsturz commented 5 years ago

Looks very promising! I will look into it during the weekend.

By the way with so many readouts is there a way to increase exploration?

Am 26.02.2019 um 03:43 schrieb Junyan Xu notifications@github.com:

You may now try https://github.com/alreadydone/lz/tree/tensor-accum-dev+. Tested on Google Cloud: 15270 pos/s with 4xV100, 256x19 net, and command ./leelaz -w ../../990.gz --batchsize 12 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --benchmark -v 200000 --worker 4

38865 n/s, 27054 pos/s with 8xV100, 256x19 net, and command ./leelaz --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32 --benchmark -v 200000 -w ../../990.gz

You can specify --batchsize and --worker separately for each GPU, e.g. for two GPUs (--gpu 0 --gpu 1) you can add --batchsize 12 --batchsize 16 --worker 3 --worker 2, etc. The -t parameter has no effect with this branch; the number of threads is simply the sum of worker threads over all GPUs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

alreadydone commented 5 years ago

A bug has been fixed in the tensor-accum-dev+ approach.

An experimental branch that gradually push policy towards uniform as visits increase to widen the search and help finding blind spots is https://github.com/alreadydone/lz/tree/tensor-accum-uniform (based on tensor-accum-dev+). Two parameters are added: When a position's visit count reaches the value --uniform-visits (defaulted to 1,000,000), all moves will be considered equally in terms of policy. Below the value, the policy gradually drifts towards uniform as visits accrue. The parameter --exponent (defaulted to 1) controls how fast the policy drifts. Exponent 0 means the policy doesn't drift at all and always stays uniform. To recover original behavior, set --uniform-visits to a very large number, and leave --exponent untouched. This is inspired by some recent discussions, e.g. at https://github.com/LeelaChessZero/lc0/issues/743

Ishinoshita commented 5 years ago

@alreadydone That's really nice! Progressive squashing even better than any of my fix formulas... No later than this morning, I pushed 100k playouts on empty boardwith LZ200 net, on my old PC CPU only, just to find, after a long while, that only 4-4 and 3-4 had got visits. Your fix will definitively help. Thank you. Will learn how to compile so that I can play with it.

Umsturz commented 5 years ago

So I tried tensor-accum-uniform. There is no need for #include anymore, right? I compiled it with #include first and eventhough it compiled, leelaz threw some error on startup. But without the #include it worked. Do I need something else?

For benchmarking I start leelaz and send "genmove B". I tried two different sets of parameters, without using --uniform-visits and --exponent : A) ./tau_leelaz -w best-network.gz -t 64 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 8 --batchsize 64

B) ./tau_leelaz -w best-network.gz -t 64 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32

The first game with A) started with B playing Tengen(K10). Quite interesting to say the least. Second game with A) also started with B playing Tengen(K10) and White likes to play 5-4 first and then enclose the corner with 3-4

First and second game with B) looked normal, same opening the current nets like to play. All 4-4 points and 6-3 approach. Later double approach.

With leela#207 40x256 I get with A) for the first genmove B ca. 25000-27000ns. For B) I get for the first genmove B ca 21000-24000ns.

What confuses me a little bit is the GPU utilization. During the first genmove B "nvidia-smi -l" shows following util. 0%/14%/43%/0%/28%/13%/0%/44%/ (just an example, but I tested this a couple of times and only some GPUs are utilized others stay at 0%. Maybe because of bad timing in the beginning by nvidia-smi.) After issuing the following commands and waiting until they finished: genmove B, genmove W, genmove B, genmove W, genmove B the utilization for all GPUs jumps to 99% and stays there, even without issuing any further commands. Could it be because of pondering?

Sometimes when exiting leelaz with "exit" it throws a segmentation fault(core dumped).

All in all it looks very promising (1.4x improvement), but Tengen makes me a bit skeptical ; )

alreadydone commented 5 years ago

The uniform branch defaults --uniform-visits to 1,000,000. If you want to recover original search behavior, use for example --uniform-visits 10000000000000.
I haven't observed tengen being played on the first move, and it's definitely strange to see # 207 play it. It's probably caused by a large number of concurrent threads making the search very wide at every level of the tree and uniformization of policy. With uniform-visits as above, maybe the engine won't play tengen even with 64x8 threads.
Look at pos/s to benchmark performance instead of n/s. pos/s is the amount of positions actually processed by the GPUs, while n/s includes positions in the cache and from symmetry.
It's not recommended to use the empty board to test performance. The n/s value will be boosted because there are 700% free playouts due to 8-fold symmetry. The pos/s value on the other hand will be dragged down because the search can't find enough unevaluated positions to feed the GPUs, and there's probably contention when accessing NNCache, which is mutex protected. That's probably why GPU utilization is low at the first move but full after four moves. However, some GPUs' utilization staying at 0 still surprise me. Instead, use --benchmark, which uses an asymmetric position three moves into the game, or load a sgf into midgame and genmove from there. In general, higher batchsize and worker lead to higher pos/s, but when you are able to saturate GPUs or achieve maximum pos/s in such normal positions, it's not recommended to increase --batchsize and --worker further. Among your A) and B), --worker 3 --batchsize 32 is the more reasonable one, though I think batchsize can be decreased further.

After issuing the following commands ... the utilization for all GPUs jumps to 99% and stays there, even without issuing any further commands. Could it be because of pondering?

Yeah these branches will keep pondering if --noponder is not set unless you issue the command stop or name (just hitting a key won't stop it). However, some people told me that --noponder doesn't work and I'm yet to confirm this bug.
All threads should be joined when exiting, so segmentation fault is unexpected.
Thanks for testing, and I'll keep an eye on identified issues when I test.
Added notice 4/29/2019: The "fractional backup" feature causes the displayed visits to be lower than the playouts (much lower if batchsize or gpu is large); can be disabled with --disable-frac-backup.

Umsturz commented 5 years ago

I experimented a little more. It seems that the uniform branch really finds some moves normal leela (0.16) does not find. But it still takes quite sometime before the optimal move is really considered and further investigated. I do not know the specifics but recent discussion about LCB makes me wonder if LCB+uniform would improve perfomance even more? Could LCB be easily combined with uniform? Or maybe you already did...?

alreadydone commented 5 years ago

Just pushed https://github.com/alreadydone/lz/tree/tensor-accum-uniform-0.17 https://github.com/alreadydone/lz/tree/tensor-accum-0.17 was pushed a few days ago. These have official 0.17 release merged in, including LCB.

Umsturz commented 5 years ago

Thank you for the update. The new version with 0.17 seems to have some problem because gpu util is only always around 30-40%. Before gpu util was 80-99%. I used --worker 3 --batchsize 32 and also tried lower batchsizes but gpu never goes higher than ~30%. Do I have to adjust the parameters for 0.17?

alreadydone commented 5 years ago

@Umsturz Thanks for the report. The problem is now fixed. In the earlier verison, the engine doesn't read batchsize from command line and set it equal to 1 always, due to some glitch in merging.

alreadydone / lz

tensor-accum-0.17/dev+/uniform questions/discussions #87