Closed killerducky closed 5 years ago
Setting up a test:
Collab OpenBench client script: https://colab.research.google.com/gist/DanielUranga/8396193da473bfa2f1366fe89344335b/lc0-openbench.ipynb
Fantastic. This is much needed.
Estimating the cost of SPRT tests: https://gist.github.com/AndyGrant/fdcb97d8deeb2540d6108741aac3a016
usage: estimate.py [-h] [--alpha ALPHA] [--beta BETA] [--lower LOWER]
[--upper UPPER]
elo_diff draw_rate
python estimate.py 10 .6 --lower=0 --upper=5 --alpha=0.95 --beta=0.95
First column is the actual Elo diff that the test runs at, Games is how long until SPRT finishes assuming that Elo diff.
Elo diff | [0,5] Games | [0,10] Games |
---|---|---|
-15 | 2676 | 1221 |
-10 | 3830 | 1683 |
-5 | 6731 | 2709 |
0 | 27768 | 6942 |
2.5 | 49385 | 31717 |
5 | 13064 | 12334 |
10 | 5286 | 3264 |
15 | 3313 | 1880 |
20 | 2411 | 1320 |
25 | 1895 | 1016 |
Some extra instructions.
pip install bs4 lxml mkdir Networks
Instructions should say Client, not client. should exist
Auto-download of networks file is broken if the Networks folder does not exist in Client.
./cutechess: error while loading shared libraries: libQtCore.so.4: cannot open shared object file: No such file or directory
Prob since I"m using ssh. Should cutechess-cli be used when there's no terminal? Or doesn't matter if there's no display available?
@roy7 run apt install libqtcore4
, that will work on Debian based systems.
mps19 reports:
That's a 90 Elo difference in results of vs A/B Engines compared to self-play. I'd like to replicate this in OpenBench.
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(%)
1 Stockfish 9 : 3446 26 135.0 214 63.1 95
2 lc0 10583 : 3411 37 114.5 181 63.3 82
3 Stockfish 8 : 3393 25 123.5 224 55.1 61
4 lc0_tcec 10520 : 3385 45 57.5 94 61.2 55
5 lc0 10520 : 3382 35 74.5 124 60.1 55
6 lc0 10751 : 3379 41 78.0 130 60.0 51
7 lc0 10663 : 3378 54 55.5 102 54.4 62
8 lc0 10852 : 3365 54 41.0 70 58.6 66
9 lc0 10780 : 3353 33 122.5 216 56.7 54
10 lc0 10965 : 3351 41 79.5 140 56.8 86
11 lc0 10925 : 3322 40 54.5 104 52.4 83
12 lc0 594 : 3296 46 50.5 104 48.6 77
13 Stockfish 5 : 3278 25 76.0 185 41.1 66
14 Ethereal 10.55 : 3271 30 79.0 208 38.0 55
15 Ethereal 10.81 : 3268 41 60.0 160 37.5 57
16 Andscacs 0.94 : 3264 38 73.5 194 37.9 78
17 lc0 10161 : 3244 47 46.0 104 44.2 98
18 Laser 1.6 : 3188 35 48.0 184 26.1 ---
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(%)
1 lc0 10965 : 163 74 46.0 80 57.5 78
2 lc0 10751 : 139 70 60.0 112 53.6 59
3 lc0 10583 : 133 58 97.5 177 55.1 73
4 lc0 10520 : 118 54 88.0 166 53.0 55
5 lc0 10663 : 114 65 52.0 102 51.0 59
6 lc0 10780 : 108 71 49.5 98 50.5 57
7 lc0 10925 : 101 93 19.5 40 48.8 54
8 lc0 10852 : 97 81 24.5 50 49.0 62
9 lc0_tcec 10520 : 82 87 15.0 30 50.0 80
10 lc0 594 : 45 55 62.5 151 41.4 95
11 lc0 10161 : 0 ---- 29.5 82 36.0 ---
Test conditions: TC: 30+1 seconds Machine: 8 core Xeon 2.2 GHz with P100 GPU Leela ratio: 0.78 Book: Perfect2017, 8 ply, No TB Resign rule: 5 moves above 700 cp for both engines Draw rule: 5 moves within 5 cp after move 40 for both engines Result computed with ordo, where AB programs are given its CCRL 40/40 ratings with error bars as loose anchors
Head to head stats:
2) lc0 10583 3411 : 181 (+65,=99,-17), 63.3 %
vs. : games ( +, =, -), (%) : Diff, SD, CFS (%)
Stockfish 9 : 26 ( 5, 18, 3), 53.8 : -35, 21, 5.0
Stockfish 8 : 38 ( 7, 23, 8), 48.7 : +18, 19, 82.3
Stockfish 5 : 23 ( 8, 14, 1), 65.2 : +133, 20, 100.0
Ethereal 10.55 : 26 ( 10, 13, 3), 63.5 : +140, 22, 100.0
Ethereal 10.81 : 20 ( 5, 14, 1), 60.0 : +143, 24, 100.0
Andscacs 0.94 : 26 ( 13, 12, 1), 73.1 : +147, 22, 100.0
Laser 1.6 : 22 ( 17, 5, 0), 88.6 : +223, 22, 100.0
10) lc0 10965 3351 : 140 (+44,=71,-25), 56.8 %
vs. : games ( +, =, -), (%) : Diff, SD, CFS (%)
Stockfish 9 : 20 ( 3, 8, 9), 35.0 : -95, 20, 0.0
Stockfish 8 : 20 ( 3, 12, 5), 45.0 : -42, 22, 2.9
Stockfish 5 : 20 ( 4, 13, 3), 52.5 : +72, 21, 100.0
Ethereal 10.55 : 20 ( 11, 6, 3), 70.0 : +79, 23, 100.0
Ethereal 10.81 : 20 ( 6, 13, 1), 62.5 : +82, 25, 100.0
Andscacs 0.94 : 20 ( 6, 11, 3), 57.5 : +87, 22, 100.0
Laser 1.6 : 20 ( 11, 8, 1), 75.0 : +163, 23, 100.0
1) lc0 10965 163 : 80 (+17,=58,-5), 57.5 %
vs. : games ( +, =, -), (%) : Diff, SD, CFS (%)
lc0 10751 : 20 ( 1, 17, 2), 47.5 : +24, 32, 77.7
lc0 10583 : 20 ( 6, 14, 0), 65.0 : +30, 30, 83.9
lc0 10520 : 20 ( 3, 15, 2), 52.5 : +45, 26, 95.5
lc0 594 : 20 ( 7, 12, 1), 65.0 : +118, 33, 100.0
3) lc0 10583 133 : 177 (+40,=115,-22), 55.1 %
vs. : games ( +, =, -), (%) : Diff, SD, CFS (%)
lc0 10965 : 20 ( 0, 14, 6), 35.0 : -30, 30, 16.1
lc0 10751 : 21 ( 4, 16, 1), 57.1 : -6, 27, 41.4
lc0 10520 : 26 ( 2, 18, 6), 42.3 : +15, 25, 72.8
lc0 10663 : 16 ( 2, 13, 1), 53.1 : +19, 29, 74.3
lc0 10780 : 26 ( 2, 22, 2), 50.0 : +25, 27, 82.2
lc0 10925 : 10 ( 4, 3, 3), 55.0 : +33, 44, 77.0
lc0 10852 : 10 ( 3, 6, 1), 60.0 : +37, 37, 83.9
lc0_tcec 10520 : 6 ( 1, 5, 0), 58.3 : +51, 42, 88.8
lc0 594 : 16 ( 8, 6, 2), 68.8 : +88, 28, 99.9
lc0 10161 : 26 ( 14, 12, 0), 76.9 : +133, 30, 100.0
Some tests are getting timeouts. It looks like most of them are from google colabs machines (Raincloud and xyzzy). I setup a master vs master test with no changes, and Raincloud got some timeouts: http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/10/
My machine is also getting some timeouts: http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/8/ - 10/100 http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/7/ - 48/920 http://ec2-34-217-73-2.us-west-2.compute.amazonaws.com:8000/viewTest/4/ - 1/140
These are some related PRs: https://github.com/LeelaChessZero/lc0/pull/277 https://github.com/LeelaChessZero/lc0/pull/243 https://github.com/LeelaChessZero/lc0/pull/276
See also https://github.com/LeelaChessZero/lc0/issues/105
I couldn't reproduce timeouts on my machine. But I did see it using up to 68ms of the 100ms move-overhead buffer. So there's a good chance it was exceeded in a few games. And maybe google colab machines are more vulnerable.
Edit: Ok I have reproduced timeouts now by setting "Aversion to search if change unlikely"=0 aka --futile-search-aversion aka FMA. This makes it so every move does the full search time without exiting early. Since that could happen for basically any move where 2 or more moves are likely, it's just random luck that default FMA and default --move-overhead haven't timed out on my machine.
No near term plans to work on this. Closing for now.
@DanielUranga aka fersberry has let me setup an Alpha test of OpenBench for lc0. This will allow us to formalize the testing of changes to code and/or parameters. @mooskagh is working on a longer term solution, but until then we can use Andy Grant's OpenBench with some small changes.
Working:
Limitations for now:
We need a few Alpha testers with Linux and Nvidia GPU to test. To help, create an account on the server:
Next (Note the temporary bug workaround
mkdir Engines
)The client will upload results every 10 games. To stop the client just ctrl-c it.
The first test is of @DanielUranga's minimaxtake2 branch: https://github.com/LeelaChessZero/lc0/pull/243