LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.44k stars 528 forks source link

Alpha test of OpenBench for lc0 #284

Closed killerducky closed 5 years ago

killerducky commented 6 years ago

@DanielUranga aka fersberry has let me setup an Alpha test of OpenBench for lc0. This will allow us to formalize the testing of changes to code and/or parameters. @mooskagh is working on a longer term solution, but until then we can use Andy Grant's OpenBench with some small changes.

Working:

Limitations for now:

We need a few Alpha testers with Linux and Nvidia GPU to test. To help, create an account on the server:

Next (Note the temporary bug workaround mkdir Engines)

pip3 install bs4 lxml
git clone -b lc0 https://github.com/killerducky/OpenBench
cd OpenBench/Client
mkdir Engines
python3 OpenBench.py -U username -P password -S http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000 -T 2

The client will upload results every 10 games. To stop the client just ctrl-c it.

The first test is of @DanielUranga's minimaxtake2 branch: https://github.com/LeelaChessZero/lc0/pull/243

killerducky commented 6 years ago

Setting up a test:

screenshot from 2018-08-20 19-50-53

DanielUranga commented 6 years ago

Collab OpenBench client script: https://colab.research.google.com/gist/DanielUranga/8396193da473bfa2f1366fe89344335b/lc0-openbench.ipynb

Videodr0me commented 6 years ago

Fantastic. This is much needed.

killerducky commented 6 years ago

Estimating the cost of SPRT tests: https://gist.github.com/AndyGrant/fdcb97d8deeb2540d6108741aac3a016

usage: estimate.py [-h] [--alpha ALPHA] [--beta BETA] [--lower LOWER]
                   [--upper UPPER]
                   elo_diff draw_rate

python estimate.py 10 .6 --lower=0 --upper=5 --alpha=0.95 --beta=0.95

First column is the actual Elo diff that the test runs at, Games is how long until SPRT finishes assuming that Elo diff.

Elo diff [0,5] Games [0,10] Games
-15 2676 1221
-10 3830 1683
-5 6731 2709
0 27768 6942
2.5 49385 31717
5 13064 12334
10 5286 3264
15 3313 1880
20 2411 1320
25 1895 1016
roy7 commented 6 years ago

Some extra instructions.

pip install bs4 lxml mkdir Networks

gonzalezjo commented 6 years ago

Instructions should say Client, not client. should exist

Auto-download of networks file is broken if the Networks folder does not exist in Client.

roy7 commented 6 years ago

./cutechess: error while loading shared libraries: libQtCore.so.4: cannot open shared object file: No such file or directory

Prob since I"m using ssh. Should cutechess-cli be used when there's no terminal? Or doesn't matter if there's no display available?

DanielUranga commented 6 years ago

@roy7 run apt install libqtcore4, that will work on Debian based systems.

killerducky commented 6 years ago

mps19 reports:

That's a 90 Elo difference in results of vs A/B Engines compared to self-play. I'd like to replicate this in OpenBench.

   # PLAYER            :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)
   1 Stockfish 9       :    3446     26   135.0     214  63.1      95
   2 lc0 10583         :    3411     37   114.5     181  63.3      82
   3 Stockfish 8       :    3393     25   123.5     224  55.1      61
   4 lc0_tcec 10520    :    3385     45    57.5      94  61.2      55
   5 lc0 10520         :    3382     35    74.5     124  60.1      55
   6 lc0 10751         :    3379     41    78.0     130  60.0      51
   7 lc0 10663         :    3378     54    55.5     102  54.4      62
   8 lc0 10852         :    3365     54    41.0      70  58.6      66
   9 lc0 10780         :    3353     33   122.5     216  56.7      54
  10 lc0 10965         :    3351     41    79.5     140  56.8      86
  11 lc0 10925         :    3322     40    54.5     104  52.4      83
  12 lc0 594           :    3296     46    50.5     104  48.6      77
  13 Stockfish 5       :    3278     25    76.0     185  41.1      66
  14 Ethereal 10.55    :    3271     30    79.0     208  38.0      55
  15 Ethereal 10.81    :    3268     41    60.0     160  37.5      57
  16 Andscacs 0.94     :    3264     38    73.5     194  37.9      78
  17 lc0 10161         :    3244     47    46.0     104  44.2      98
  18 Laser 1.6         :    3188     35    48.0     184  26.1     ---
   # PLAYER            :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)
   1 lc0 10965         :     163     74    46.0      80  57.5      78
   2 lc0 10751         :     139     70    60.0     112  53.6      59
   3 lc0 10583         :     133     58    97.5     177  55.1      73
   4 lc0 10520         :     118     54    88.0     166  53.0      55
   5 lc0 10663         :     114     65    52.0     102  51.0      59
   6 lc0 10780         :     108     71    49.5      98  50.5      57
   7 lc0 10925         :     101     93    19.5      40  48.8      54
   8 lc0 10852         :      97     81    24.5      50  49.0      62
   9 lc0_tcec 10520    :      82     87    15.0      30  50.0      80
  10 lc0 594           :      45     55    62.5     151  41.4      95
  11 lc0 10161         :       0   ----    29.5      82  36.0     ---

Test conditions: TC: 30+1 seconds Machine: 8 core Xeon 2.2 GHz with P100 GPU Leela ratio: 0.78 Book: Perfect2017, 8 ply, No TB Resign rule: 5 moves above 700 cp for both engines Draw rule: 5 moves within 5 cp after move 40 for both engines Result computed with ordo, where AB programs are given its CCRL 40/40 ratings with error bars as loose anchors

Head to head stats:

 2) lc0 10583      3411 :    181 (+65,=99,-17),  63.3 %

    vs.                  :  games (  +,  =,  -),   (%) :   Diff,   SD, CFS (%)
    Stockfish 9          :     26 (  5, 18,  3),  53.8 :    -35,   21,    5.0
    Stockfish 8          :     38 (  7, 23,  8),  48.7 :    +18,   19,   82.3
    Stockfish 5          :     23 (  8, 14,  1),  65.2 :   +133,   20,  100.0
    Ethereal 10.55       :     26 ( 10, 13,  3),  63.5 :   +140,   22,  100.0
    Ethereal 10.81       :     20 (  5, 14,  1),  60.0 :   +143,   24,  100.0
    Andscacs 0.94        :     26 ( 13, 12,  1),  73.1 :   +147,   22,  100.0
    Laser 1.6            :     22 ( 17,  5,  0),  88.6 :   +223,   22,  100.0

10) lc0 10965      3351 :    140 (+44,=71,-25),  56.8 %

    vs.                  :  games (  +,  =,  -),   (%) :   Diff,   SD, CFS (%)
    Stockfish 9          :     20 (  3,  8,  9),  35.0 :    -95,   20,    0.0
    Stockfish 8          :     20 (  3, 12,  5),  45.0 :    -42,   22,    2.9
    Stockfish 5          :     20 (  4, 13,  3),  52.5 :    +72,   21,  100.0
    Ethereal 10.55       :     20 ( 11,  6,  3),  70.0 :    +79,   23,  100.0
    Ethereal 10.81       :     20 (  6, 13,  1),  62.5 :    +82,   25,  100.0
    Andscacs 0.94        :     20 (  6, 11,  3),  57.5 :    +87,   22,  100.0
    Laser 1.6            :     20 ( 11,  8,  1),  75.0 :   +163,   23,  100.0

 1) lc0 10965      163 :     80 (+17,=58,-5),  57.5 %

    vs.                  :  games (  +,  =, -),   (%) :   Diff,   SD, CFS (%)
    lc0 10751            :     20 (  1, 17, 2),  47.5 :    +24,   32,   77.7
    lc0 10583            :     20 (  6, 14, 0),  65.0 :    +30,   30,   83.9
    lc0 10520            :     20 (  3, 15, 2),  52.5 :    +45,   26,   95.5
    lc0 594              :     20 (  7, 12, 1),  65.0 :   +118,   33,  100.0

 3) lc0 10583      133 :    177 (+40,=115,-22),  55.1 %

    vs.                  :  games (  +,   =,  -),   (%) :   Diff,   SD, CFS (%)
    lc0 10965            :     20 (  0,  14,  6),  35.0 :    -30,   30,   16.1
    lc0 10751            :     21 (  4,  16,  1),  57.1 :     -6,   27,   41.4
    lc0 10520            :     26 (  2,  18,  6),  42.3 :    +15,   25,   72.8
    lc0 10663            :     16 (  2,  13,  1),  53.1 :    +19,   29,   74.3
    lc0 10780            :     26 (  2,  22,  2),  50.0 :    +25,   27,   82.2
    lc0 10925            :     10 (  4,   3,  3),  55.0 :    +33,   44,   77.0
    lc0 10852            :     10 (  3,   6,  1),  60.0 :    +37,   37,   83.9
    lc0_tcec 10520       :      6 (  1,   5,  0),  58.3 :    +51,   42,   88.8
    lc0 594              :     16 (  8,   6,  2),  68.8 :    +88,   28,   99.9
    lc0 10161            :     26 ( 14,  12,  0),  76.9 :   +133,   30,  100.0
killerducky commented 6 years ago

Some tests are getting timeouts. It looks like most of them are from google colabs machines (Raincloud and xyzzy). I setup a master vs master test with no changes, and Raincloud got some timeouts: http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/10/

My machine is also getting some timeouts: http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/8/ - 10/100 http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/7/ - 48/920 http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/4/ - 1/140

These are some related PRs: https://github.com/LeelaChessZero/lc0/pull/277 https://github.com/LeelaChessZero/lc0/pull/243 https://github.com/LeelaChessZero/lc0/pull/276

killerducky commented 6 years ago

See also https://github.com/LeelaChessZero/lc0/issues/105

I couldn't reproduce timeouts on my machine. But I did see it using up to 68ms of the 100ms move-overhead buffer. So there's a good chance it was exceeded in a few games. And maybe google colab machines are more vulnerable.

Edit: Ok I have reproduced timeouts now by setting "Aversion to search if change unlikely"=0 aka --futile-search-aversion aka FMA. This makes it so every move does the full search time without exiting early. Since that could happen for basically any move where 2 or more moves are likely, it's just random luck that default FMA and default --move-overhead haven't timed out on my machine.

killerducky commented 5 years ago

No near term plans to work on this. Closing for now.