Alpha test of OpenBench for lc0

killerducky commented 6 years ago

@DanielUranga aka fersberry has let me setup an Alpha test of OpenBench for lc0. This will allow us to formalize the testing of changes to code and/or parameters. @mooskagh is working on a longer term solution, but until then we can use Andy Grant's OpenBench with some small changes.

Working:

Auto git clone and build of arbitrary git SHAs from github
Auto download of network file
lc0 vs lc0 SPRT tests

Limitations for now:

Only supports linux, and maybe only Nvidia GPUs
Only supports lc0 vs lc0 SPRT tests (lc0 vs A/B engine TBD).
Does not support scaling yet, so all machines fast or slow use the same time controls.
Names of tests are the branch or SHA being tested. For branch this isn't too bad but for SHA it's not good. We need an extra field to allow users to give a short title for the test, and maybe a longer description field also.

We need a few Alpha testers with Linux and Nvidia GPU to test. To help, create an account on the server:

http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000

Next (Note the temporary bug workaround mkdir Engines)

pip3 install bs4 lxml
git clone -b lc0 https://github.com/killerducky/OpenBench
cd OpenBench/Client
mkdir Engines
python3 OpenBench.py -U username -P password -S http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000 -T 2

The client will upload results every 10 games. To stop the client just ctrl-c it.

The first test is of @DanielUranga's minimaxtake2 branch: https://github.com/LeelaChessZero/lc0/pull/243

killerducky commented 6 years ago

Setting up a test:

Dev/Base Branch can be a git branch name, or an arbitrary git SHA.
Dev/Base Bench is not used yet, just put a fake number 1 there.
Dev/Base Options
- threads must be first
- UCI options, not command line.
- Use "_" instead of " "
- Network is the number, this network is auto-downloaded from http://testserver.lczero.org
Admin must grant permissions to users to create and approve tests. We could have e.g. ~10 creators and ~3 approvers.

screenshot from 2018-08-20 19-50-53

DanielUranga commented 6 years ago

Collab OpenBench client script: https://colab.research.google.com/gist/DanielUranga/8396193da473bfa2f1366fe89344335b/lc0-openbench.ipynb

Videodr0me commented 6 years ago

Fantastic. This is much needed.

killerducky commented 6 years ago

Estimating the cost of SPRT tests: https://gist.github.com/AndyGrant/fdcb97d8deeb2540d6108741aac3a016

usage: estimate.py [-h] [--alpha ALPHA] [--beta BETA] [--lower LOWER]
                   [--upper UPPER]
                   elo_diff draw_rate

python estimate.py 10 .6 --lower=0 --upper=5 --alpha=0.95 --beta=0.95

First column is the actual Elo diff that the test runs at, Games is how long until SPRT finishes assuming that Elo diff.

Elo diff	[0,5] Games	[0,10] Games
-15	2676	1221
-10	3830	1683
-5	6731	2709
0	27768	6942
2.5	49385	31717
5	13064	12334
10	5286	3264
15	3313	1880
20	2411	1320
25	1895	1016

roy7 commented 6 years ago

Some extra instructions.

pip install bs4 lxml mkdir Networks

gonzalezjo commented 6 years ago

Instructions should say Client, not client. should exist

Auto-download of networks file is broken if the Networks folder does not exist in Client.

roy7 commented 6 years ago

./cutechess: error while loading shared libraries: libQtCore.so.4: cannot open shared object file: No such file or directory

Prob since I"m using ssh. Should cutechess-cli be used when there's no terminal? Or doesn't matter if there's no display available?

DanielUranga commented 6 years ago

@roy7 run apt install libqtcore4, that will work on Debian based systems.

killerducky commented 6 years ago

mps19 reports:

10583 > 10965 by 60 Elo vs A/B Engines
10965 > 10583 by 30 Elo in self play

That's a 90 Elo difference in results of vs A/B Engines compared to self-play. I'd like to replicate this in OpenBench.

   # PLAYER            :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)
   1 Stockfish 9       :    3446     26   135.0     214  63.1      95
   2 lc0 10583         :    3411     37   114.5     181  63.3      82
   3 Stockfish 8       :    3393     25   123.5     224  55.1      61
   4 lc0_tcec 10520    :    3385     45    57.5      94  61.2      55
   5 lc0 10520         :    3382     35    74.5     124  60.1      55
   6 lc0 10751         :    3379     41    78.0     130  60.0      51
   7 lc0 10663         :    3378     54    55.5     102  54.4      62
   8 lc0 10852         :    3365     54    41.0      70  58.6      66
   9 lc0 10780         :    3353     33   122.5     216  56.7      54
  10 lc0 10965         :    3351     41    79.5     140  56.8      86
  11 lc0 10925         :    3322     40    54.5     104  52.4      83
  12 lc0 594           :    3296     46    50.5     104  48.6      77
  13 Stockfish 5       :    3278     25    76.0     185  41.1      66
  14 Ethereal 10.55    :    3271     30    79.0     208  38.0      55
  15 Ethereal 10.81    :    3268     41    60.0     160  37.5      57
  16 Andscacs 0.94     :    3264     38    73.5     194  37.9      78
  17 lc0 10161         :    3244     47    46.0     104  44.2      98
  18 Laser 1.6         :    3188     35    48.0     184  26.1     ---

   # PLAYER            :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)
   1 lc0 10965         :     163     74    46.0      80  57.5      78
   2 lc0 10751         :     139     70    60.0     112  53.6      59
   3 lc0 10583         :     133     58    97.5     177  55.1      73
   4 lc0 10520         :     118     54    88.0     166  53.0      55
   5 lc0 10663         :     114     65    52.0     102  51.0      59
   6 lc0 10780         :     108     71    49.5      98  50.5      57
   7 lc0 10925         :     101     93    19.5      40  48.8      54
   8 lc0 10852         :      97     81    24.5      50  49.0      62
   9 lc0_tcec 10520    :      82     87    15.0      30  50.0      80
  10 lc0 594           :      45     55    62.5     151  41.4      95
  11 lc0 10161         :       0   ----    29.5      82  36.0     ---

Test conditions: TC: 30+1 seconds Machine: 8 core Xeon 2.2 GHz with P100 GPU Leela ratio: 0.78 Book: Perfect2017, 8 ply, No TB Resign rule: 5 moves above 700 cp for both engines Draw rule: 5 moves within 5 cp after move 40 for both engines Result computed with ordo, where AB programs are given its CCRL 40/40 ratings with error bars as loose anchors

Head to head stats:

 2) lc0 10583      3411 :    181 (+65,=99,-17),  63.3 %

    vs.                  :  games (  +,  =,  -),   (%) :   Diff,   SD, CFS (%)
    Stockfish 9          :     26 (  5, 18,  3),  53.8 :    -35,   21,    5.0
    Stockfish 8          :     38 (  7, 23,  8),  48.7 :    +18,   19,   82.3
    Stockfish 5          :     23 (  8, 14,  1),  65.2 :   +133,   20,  100.0
    Ethereal 10.55       :     26 ( 10, 13,  3),  63.5 :   +140,   22,  100.0
    Ethereal 10.81       :     20 (  5, 14,  1),  60.0 :   +143,   24,  100.0
    Andscacs 0.94        :     26 ( 13, 12,  1),  73.1 :   +147,   22,  100.0
    Laser 1.6            :     22 ( 17,  5,  0),  88.6 :   +223,   22,  100.0

10) lc0 10965      3351 :    140 (+44,=71,-25),  56.8 %

    vs.                  :  games (  +,  =,  -),   (%) :   Diff,   SD, CFS (%)
    Stockfish 9          :     20 (  3,  8,  9),  35.0 :    -95,   20,    0.0
    Stockfish 8          :     20 (  3, 12,  5),  45.0 :    -42,   22,    2.9
    Stockfish 5          :     20 (  4, 13,  3),  52.5 :    +72,   21,  100.0
    Ethereal 10.55       :     20 ( 11,  6,  3),  70.0 :    +79,   23,  100.0
    Ethereal 10.81       :     20 (  6, 13,  1),  62.5 :    +82,   25,  100.0
    Andscacs 0.94        :     20 (  6, 11,  3),  57.5 :    +87,   22,  100.0
    Laser 1.6            :     20 ( 11,  8,  1),  75.0 :   +163,   23,  100.0

 1) lc0 10965      163 :     80 (+17,=58,-5),  57.5 %

    vs.                  :  games (  +,  =, -),   (%) :   Diff,   SD, CFS (%)
    lc0 10751            :     20 (  1, 17, 2),  47.5 :    +24,   32,   77.7
    lc0 10583            :     20 (  6, 14, 0),  65.0 :    +30,   30,   83.9
    lc0 10520            :     20 (  3, 15, 2),  52.5 :    +45,   26,   95.5
    lc0 594              :     20 (  7, 12, 1),  65.0 :   +118,   33,  100.0

 3) lc0 10583      133 :    177 (+40,=115,-22),  55.1 %

    vs.                  :  games (  +,   =,  -),   (%) :   Diff,   SD, CFS (%)
    lc0 10965            :     20 (  0,  14,  6),  35.0 :    -30,   30,   16.1
    lc0 10751            :     21 (  4,  16,  1),  57.1 :     -6,   27,   41.4
    lc0 10520            :     26 (  2,  18,  6),  42.3 :    +15,   25,   72.8
    lc0 10663            :     16 (  2,  13,  1),  53.1 :    +19,   29,   74.3
    lc0 10780            :     26 (  2,  22,  2),  50.0 :    +25,   27,   82.2
    lc0 10925            :     10 (  4,   3,  3),  55.0 :    +33,   44,   77.0
    lc0 10852            :     10 (  3,   6,  1),  60.0 :    +37,   37,   83.9
    lc0_tcec 10520       :      6 (  1,   5,  0),  58.3 :    +51,   42,   88.8
    lc0 594              :     16 (  8,   6,  2),  68.8 :    +88,   28,   99.9
    lc0 10161            :     26 ( 14,  12,  0),  76.9 :   +133,   30,  100.0

killerducky commented 6 years ago

Some tests are getting timeouts. It looks like most of them are from google colabs machines (Raincloud and xyzzy). I setup a master vs master test with no changes, and Raincloud got some timeouts: http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/10/

My machine is also getting some timeouts: http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/8/ - 10/100 http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/7/ - 48/920 http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/4/ - 1/140

These are some related PRs: https://github.com/LeelaChessZero/lc0/pull/277 https://github.com/LeelaChessZero/lc0/pull/243 https://github.com/LeelaChessZero/lc0/pull/276

killerducky commented 6 years ago

I couldn't reproduce timeouts on my machine. But I did see it using up to 68ms of the 100ms move-overhead buffer. So there's a good chance it was exceeded in a few games. And maybe google colab machines are more vulnerable.

Edit: Ok I have reproduced timeouts now by setting "Aversion to search if change unlikely"=0 aka --futile-search-aversion aka FMA. This makes it so every move does the full search time without exiting early. Since that could happen for basically any move where 2 or more moves are likely, it's just random luck that default FMA and default --move-overhead haven't timed out on my machine.

killerducky commented 5 years ago

No near term plans to work on this. Closing for now.

LeelaChessZero / lc0

Alpha test of OpenBench for lc0 #284