Closed sf-x closed 4 years ago
So far all my tests use 100ms increment and base times between 100ms and 10s (to avoid repeated games). I will probably switch to always using 10s base time as soon as I have opening books for all variants. I think that using conditions similar to the ones on fishtest is reasonable.
I am open for suggestions on time control, SPRT bounds, etc., but we should always consider the very limited resources compared to fishtest.
We can ask SF-dev-team to use their base http://tests.stockfishchess.org/tests ... half joke
@ianfab writes:
In the following I am going to summarize my current thinking on testing and the way I have been doing tests so far. Feedback and suggestions are very welcome.
Just like on fishtest, I think there should be a first test on STC, and then a second test to show the scaling to LTC, since we have already seen that not doing LTC tests can cause regressions to pass undetected.
Using the same time controls as fishtest (STC 10s+0.1s, LTC 60s+0.1s) would probably take too long for LTC, so it could make sense to half/reduce the time controls. Reducing both STC and LTC could restrict the validity of STC tests, whereas only reducing LTC would make it more difficult to see scaling effects, so I am currently not sure about that. Since I did not have opening books for all variants until two days a ago, I have varied the base time (0.1-10s) to avoid repeated games, but used the the same increment as STC on fishtest (0.1s).
Since we are testing many variants, I think we need an automatic way to generate the opening books. Since Stockfish might be the only (strong) engine for some variants, it makes sense to use it for generation. I have started to write such a book generator based on Stockfish two days ago, but of course it is rather experimental yet. Using it, I generated a set of EPD opening books for the variants supported by Stockfish (excluding Relay chess, since I have not dealt with it yet).
Since I do not know of another way to perform SPRT tests for variant engines, I use my very basic far-from-perfect testing script. Nevertheless, it at least seems to work quite well in pratice so far, since my patches tested with this script improved Stockfish by several hundred Elo in several variants.
So far I have always used [0,20] Elo bounds for SPRT tests. I got to those empirically by considering the typical Elo gains of patches and the number of games neccessary for such patches to pass an SPRT test. In the beginning almost all patches added new code/ideas and the Elo gain typically was huge (>50 Elo), so this was fine, but since more and more patches are going to be paramter tweaks and simplifications with only small Elo differences, and since Elo differences are decreasing as Stockfish is improving, we have to think about the SPRT bounds. Maybe we could simply use a multiple (3-5?) of the bounds used on fishtest (general [0,5], tweaks [0,4], simplifications [-3,1]).
Since LTC tests could take a long time, we could think about using a fixed number of games for LTC to only regression test the changes. Probably time will tell whether lengthy tests on LTC are feasible.
I have written a slightly modified version of Stockfish's SPSA tuner to add variant support. It should be more or less self-explanatory if you know how to use the SPSA tuner for official Stockfish.
Thanks, although I am unfamiliar with SPRT I believe that is excellent guidance. I studied statistics both at university and in high school, I simply am unfamiliar with this particular heuristic/experiment.
Due to official-stockfish/Stockfish#603 and official-stockfish playing in competitions with "Move Overhead=1000", I think LTC (and possibly STC?) tests should be performed with "Move Overhead=1000" and a minimum increment of 1s. It baffles me that fishtest would use different conditions although perhaps official-stockfish/Stockfish is less prone to timeout than this fork?
I have started experimenting with your modified SPSA tuner and submitted a PR addressing most of the confusion I encountered trying to use it.
My ideas, as an Ideas Guy(TM): It's essential to keep results per side (if playing without book) or results of game pairs with the same start position (if using a large set of start positions).. Not doing that results in overestimation of statistical error (see http://talkchess.com/forum/viewtopic.php?t=61105&highlight=gsprt ) and waste of testing resources. Why doesn't mainline Stockfish testing doesn't do that? - well, they found enough dupes to contribute their resources.
@sf-x In principle, this is a good point. However, if we use balanced positions and large opening books, the effect should be rather small, I guess. On fishtest, you additionally have to take into account that results might differ on different machines. Furthermore, I am not sure whether you could absorb (an expectation value of) the differences into the LLR bounds or other parameters and simply reinterpret these parameters. So I do not understand enough of it yet to be able to judge whether this would change much in practice.
@ddugovic So far there have been only very few time losses in my tests (reported at the end of a test), probably because my testing script takes the times from the output of the engine and does not measure the thinking time itself. This of course assumes the engine to be honest, but as long as a patch does not change the output of thinking time, this should not make a difference.
I have not done much testing with CuteChess or the like, so I can not say much about whether there would be time losses.
However, if we use balanced positions and large opening books, the effect should be rather small, I guess.
In the linked thread, for allegedly balanced positions in regular chess it was found that error estimated with the common (broken) method was 1.2x bigger than estimate from game pairs. Is that small?
On fishtest, you additionally have to take into account that results might differ on different machines.
How can this be taken into account?
Furthermore, I am not sure whether you could absorb (an expectation value of) the differences into the LLR bounds or other parameters and simply reinterpret these parameters.
???
However, if we use balanced positions and large opening books, the effect should be rather small, I guess.
In the linked thread, for allegedly balanced positions in regular chess it was found that error estimated with the common (broken) method was 1.2x bigger than estimate from game pairs. Is that small?
Well, 20% is something, but it is not a huge difference and I would be cautious to call a well working method "broken" even if it is not entirely correct from a theoretical point of view.
On fishtest, you additionally have to take into account that results might differ on different machines.
How can this be taken into account?
If you want to take into account correlations, you should consider factors that influence the correlations, right?
Furthermore, I am not sure whether you could absorb (an expectation value of) the differences into the LLR bounds or other parameters and simply reinterpret these parameters.
???
I do not know whether this applies in this case, but if you can get the same or similar results with the current model by simply redefining existing parameters, a new model does not really make a difference. Then only the interpretation of the parameters, but not the calculation itself would be wrong.
Currently this is my best LTC test command using cutechess-cli:
cutechess-cli -variant crazyhouse -engine cmd=./stockfish-x86_64-bmi2-kingsafety -engine cmd=./stockfish-x86_64-bmi2-master -each proto=uci tc=240/60+10 "option.Move Overhead=1000" -rounds 200 -concurrency 4 -openings file=books/crazyhouse.epd format=epd order=random -repeat -pgnout test.pgn min
@ddugovic While long time controls are desirable, I do not think it is practical to use this time control, because it often requires much more than 200 games to get a statistically significant result.
Because of the recently generated opening books, I have switched to using constant base times. I am currently running tests on #145 with different time controls (5+0.05, 10+0.1, 30+0.3, 60+0.6, 120+1.2) and the new opening book to get an idea of which STC and LTC could be used to test for scaling. Of course, one patch is not that much data, but the patch seems to perform well at short time controls while scaling is very bad, so it is a great test case.
Here are the results of tests I started yesterday with the old settings of varying base times:
0.1-10+0.1
LLR: 3.00 (-2.94,2.94) [0.00,20.00]
Total: 740 W: 388 L: 317 D: 35
0.6-60+0.6
LLR: -3.00 (-2.94,2.94) [0.00,20.00]
Total: 298 W: 118 L: 158 D: 22
I will gradually add other results as soon as tests finish.
5+0.05
LLR: 3.01 (-2.94,2.94) [0.00,20.00]
Total: 766 W: 402 L: 330 D: 34
10+0.1
LLR: -2.99 (-2.94,2.94) [0.00,20.00]
Total: 4652 W: 2250 L: 2166 D: 236
30+0.3
LLR: -2.95 (-2.94,2.94) [0.00,20.00]
Total: 210 W: 79 L: 122 D: 9
60+0.6
LLR: -2.95 (-2.94,2.94) [0.00,20.00]
Total: 276 W: 111 L: 152 D: 13
120+1.2
LLR: -3.05 (-2.94,2.94) [0.00,20.00]
Total: 302 W: 123 L: 165 D: 14
Thanks and that makes sense, and now I'm using that book. (On my own machine in the future I will test for scaling at 30+1 before 60+1; certainly I could test at 30+0.3 etc. to make tests finish slightly faster but I prefer +1 for my own testing.)
If you want to take into account correlations, you should consider factors that influence the correlations, right?
I have no idea how to measure them. It seems to me that the effect can be adequately explained by assuming the the game results are not identically distributed. The proposed refinement is to assume (still wrongly, but a lot of data would need to be collected to to correct for it!) that the SUMS of scores in a game pair are identically distributed.
However, if we use balanced positions and large opening books, the effect should be rather small, I guess
I have played 32 game pairs from 32 different positions with @ianfab 's new book (one engine with #153 and other without), and 22 (IIRC) were won by the same side (two White wins or two Black wins). Does anybody else have data for that?
Wouldn't it be possible to fork fishtest and run it "for free" (only electricity bill)? So that we could let run a few different machines for each patch.
@arbolis If I remember correctly fishtest uses cutechess, and cutechess does not support many of the variants yet, but this is changing. It would of course be awesome to have such a testing platform.
@sf-x Sorry, I only skimmed the discussion on talkchess and misunderstood it. I do not have any data for that.
Note that WinBoard supports a 'Monte-Carlo book mode', which can be used with a randomizing engine, or a group of equally strong (or time-handicapped to become equally strong) engines. In this mode it decides the frequency with which moves should be played based on the statistics of their prior results, selects the book move that is most under-played, and lets the engine think (in the hope it will generate a new move) when all moves are played (approximately) with the frequency they deserve. (Note that the latter will always be the case if there is only one move for the position.) Just play a tournament with the additional option -mcBookMode true. You can later convert the PGN file of the tourney to a book. Instead of a randomizing engine, you can also do a number of self-play matches at various time controls. Because WinBoard would be feeding moves that were generated by the engine at one TC to engines playing at another TC it even makes sense to do the same TC several times (with other TCs in between).
I'll leave this for future reference: http://hardy.uhasselt.be/Toga/GSPRT_approximation.pdf
Should we do some regression tests from time to time? Just to make sure we're heading toward the right direction (id est elo increase).
Sure, it makes sense to do regression tests. The main questions are how often and with which time control. Regression tests should probably be based on release versions.
I see. Maybe after X functional patches? Where X could be 10 or so.
I think this is difficult to count, since there are patches for several variants and also upstream changes. I will probably simply use the release versions something like every two or four weeks and go for 1000 games per variant on 10+0.1, where it should take about one or two days to do regression tests for all variants on one machine. Regression tests on longer time control take much longer, so I will do them less often. If I have time to write an automatization script soon, first results should be there in a few days.
Update: I am now running regression tests for fishnet-091216 vs. fishnet-301116.
Periodically running the SPSA Tuner is probably a good idea as well. I am tuning razor_margin at the moment...
Yes, e.g. I have re-tuned piece values several times (especially in the case of antichess and crazyhouse, where two more tuning sessions are currently running) when there had been significant changes to the evaluation function. It of course also makes sense for search parameters, especially since the razoring margin has not been tuned yet (or to be precise only hand-tuned) for crazyhouse.
@arbolis First regression test results are there now.
I'll run a big regression test (161130 vs 161209) in Crazyhouse on my 6 cores@4GHz : 55s+2s 50s+2s 45s+2s 40s+2s 35s+2s 30s+2s
I like big increments because ZH is less linear (eval from move to next move) than chess, so each move needs a new deeper consideration.
To be clear @Vinvin20 is referring to my comment on #131 where I referenced #161 and asked him to test on the Windows platform.
It just now occurred to me that SPSA Tuner-based tests (of Stockfish) can be reposited! I intend to add one test per variant.
@ddugovic What do you mean by that?
Sorry I was unclear. So far I have added crazyhouse.conf
and crazyhouse.var
. It now occurs to me that having one .conf
and one .var
file per variant (which tune all the parameters) seems like the most efficient way to use the tuner.
Last year I attempted a fishtest install. I will attempt it again.
@ddugovic This would be great. With small changes this could at least be used for the variants supported by CuteChess and probably for all variants regarding tuning.
I'll run a big regression test (161130 vs 161209) in Crazyhouse on my 6 cores@4GHz :
Results are in "Regression test results" : https://github.com/ddugovic/Stockfish/issues/170#issuecomment-266518756
Last year I attempted a fishtest install. I will attempt it again.
@ddugovic Here is my partly successful attempt on a fishtest server and worker: http://35.161.250.236:6543/tests. It is only a first attempt, so many things do not work and/or have not been tested yet, but at least some basic functionalities are supported for variants (an intentionally bad patch for atomic chess indeed shows bad results, so it seems to be working^^). I have not added opening books for variants yet, so only variants with the same FEN format as standard chess are supported, except for giveaway, which is not supported by cutechess.
The currently running worker is an AWS EC2 c3.large spot instance (with Ubuntu 16.04), initialized with a script based on https://github.com/glinscott/fishtest/wiki/Running-the-worker-in-the-Amazon-AWS-EC2-cloud, but adjusted to be able to use a cheap instance for testing purposes. In case you want to try to add another instance, you can find the script below. You only have to replace USERNAME and PASSWORD by an account you created on http://35.161.250.236:6543/signup (and optionally adjust the number of cores in the for loop).
#!/bin/bash
# replace USERNAME and PASSWORD with your fishtest username and password
# the double quotes deal with symbols in username or password: don't delete them
username="USERNAME"
password="PASSWORD"
# update software
apt update -y
#apt full-upgrade -y
apt install -y python build-essential libqtcore4 libqt5core5a unzip
useradd -m fishtest
sudo -i -u fishtest wget https://github.com/ianfab/fishtest/archive/master.zip
# disable hyperthreads
for cpunum in $(cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -s -d, -f2- | tr ',' '\n' | sort -un); do
echo 0 > /sys/devices/system/cpu/cpu$cpunum/online
done
# generate a script for running fishtest
cat << EOF > runscript.sh
#!/bin/bash
partitionid=0
# partition the available cores in independent fishtest workers
# the sum of the partitions should be the number of physical cores -1
for partition in 1
do
partitionid=\$((partitionid+1))
mkdir part_\$partitionid
cd part_\$partitionid
unzip ../master.zip
echo "make build ARCH=x86-64 COMP=gcc -j \$partition" > fishtest-master/worker/custom_make.txt
(
# put fishtest in a loop restarting every 4h to deal with eventual hangs or crashes
while [ true ]; do
# some random delay to avoid starting too many fishtest workers at the same time
sleep "\$((RANDOM%20)).\$((RANDOM%100))"
timeout 4h python fishtest-master/worker/worker.py --concurrency \$partition "$username" "$password" >& out
done
) &
cd ..
done
wait
EOF
chmod a+rx runscript.sh
# execute script
sudo -i -u fishtest /bin/bash `pwd`/runscript.sh
# if we come here, terminate
#poweroff
So far most things are working on my fishtest instance http://35.161.250.236:6543/tests, so I invite you to create tests and/or add workers. Do not hesitate to ask in case of questions, especially if you want to contribute a worker.
What is not working yet:
What is working:
Great achievement, @ianfab ! Thank you !
Congratulations for setting up fishtesting! I tried to add my computer as worker, changed server ip and port in fishtest (config files and worker.py), created username and password in your network, but was not successful because the updater comes into a loop trying to change from version 59 to 61. My OS is Windows 10.
@lantonov Thanks for your support. So far I have only added AWS EC2 linux machines as workers, but I will try to help you to get it working on windows.
The first important point is that I had to update the cutechess compiles, because the old version used for fishtest does not support the UCI_Variant
option (and many of the variants). So far I have only updated the linux 64bit compile, so it would be nice if you could compile the latest cutechess-cli on windows and open a pull request with an updated cutechess-cli-win.zip
in https://github.com/ianfab/FishCooking.
Furthermore, you have to use the worker version from https://github.com/ianfab/fishtest, because I had to apply some changes to get it working for variants and I adjusted URLs (FishCooking repo, etc.), IP and port.
With these two changes, I think it should work. Please only add a windows machine as soon as the cutechess compile in https://github.com/ianfab/FishCooking is updated, because the worker will otherwise only play standard chess regardless of which variant is stated in the test (since the UCI_Variant option command is not send).
@ianfab Done! See https://github.com/ianfab/FishCooking/pull/1 EDIT: it is 32-bit so it will work for all Windows versions.
Thanks, Theo! I tried but cutechess gave me troubles compiling.
In Fishtest now, playing horde at the moment. However, I entered with 3 cores on one computer and although it registered them as 3 computers of 1 core, only 1 core is playing and the other 2 cores stand idle. P.S. Got it right at last. The problem was in the location of my fishtest.cfg
For losers, cutechess gives illegal moves. Variants crazyhouse and horde play normally, though there are 5/100 games lost on time in zh. Although there is warning for illegal moves in losers, it plays the games to the end, and the pgn looks normal.
If possible, maybe try configuring Move Overhead=1000
like official-stockfish does for competitions.
Thank you @theo77186 for providing the windows binary, I have just merged your PR.
And thank you @lantonov for the feedback. Let's wait for more games to be finished to get more statistics for the time losses. If the rate remains that high, move overhead might have to be increased as suggested by @ddugovic, even though I think increasing it from 30 to something like 100 should be enough to reduce the rate of time losses to a manageable level.
"No tasks available at this time, waiting..." although there are pending tasks. Also, the previous worker sessions are not closed. These may be connected.
@lantonov: as there are 4 reported (not terminated correctly) connections, the limit strikes and it prevents further connections. EDIT: unfortunately, these ghost workers cannot be killed. There is a bug in the fishtest implementation as workers are normally killed after about 1-2 hours.
For now I increased the limit of workers to 8. I will look into it later to see how to fix this.
I found out that there is a script to remove inactive workers (on the server side), so the problem should be solved for now.
On Windows 10 I installed the Linux Subsystem for Windows and under this Linux I observe:
Step 4/4. Deleting profile data ...
make ARCH=x86-64-modern COMP=gcc profileclean
make[1]: Entering directory `/tmp/tmp5E0t2D/ddugovic-Stockfish-1d5e15a/src'
make[1]: Leaving directory `/tmp/tmp5E0t2D/ddugovic-Stockfish-1d5e15a/src'
Verifying signature of stockfish ...
Verifying signature of base ...
CPU factor : 0.808452 - tc adjusted to 8.08+0.08
Running tune_atomic_close_enemies vs tune_atomic_close_enemies
['/mnt/c/Users/Gaming/Desktop/fishtest/worker/testing/cutechess-cli', '-repeat', '-rounds', '2', '-tournament', 'gauntlet', '-srand', '1144930363', '-resign', 'movecount=8', 'score=800', '-draw', 'movenumber=34', 'movecount=8', 'score=20', '-concurrency', '1', '-openings', u'file=atomic.epd', u'format=epd', 'order=random', 'plies=16', '-variant', u'atomic', '-engine', 'name=stockfish', 'cmd=stockfish', u'option.Hash=4', u'option.mCloseEnemies[ATOMIC_VARIANT]=15', '-engine', 'name=base', 'cmd=base', u'option.Hash=4', u'option.mCloseEnemies[ATOMIC_VARIANT]=19', '-each', 'proto=uci', 'tc=8.08+0.08', 'option.Threads=1']
/mnt/c/Users/Gaming/Desktop/fishtest/worker/testing/cutechess-cli: error while loading shared libraries: libQt5Core.so.5: ('TC limit', 80.84523695233669, 'End time:', datetime.datetime(2017, 1, 31, 20, 39, 47, 413730))
cannot open shared object file: No such file or directory
['/mnt/c/Users/Gaming/Desktop/fishtest/worker/testing/cutechess-cli', '-repeat', '-rounds', '2', '-tournament', 'gauntlet', '-srand', '4192311519', '-resign', 'movecount=8', 'score=800', '-draw', 'movenumber=34', 'movecount=8', 'score=20', '-concurrency', '1', '-openings', u'file=atomic.epd', u'format=epd', 'order=random', 'plies=16', '-variant', u'atomic', '-engine', 'name=stockfish', 'cmd=stockfish', u'option.Hash=4', u'option.mCloseEnemies[ATOMIC_VARIANT]=15', '-engine', 'name=base', 'cmd=base', u'option.Hash=4', u'option.mCloseEnemies[ATOMIC_VARIANT]=19', '-each', 'proto=uci', 'tc=8.08+0.08', 'option.Threads=1']
/mnt/c/Users/Gaming/Desktop/fishtest/worker/testing/cutechess-cli: error while loading shared libraries: libQt5Core.so.5: cannot open shared object file: No such file or directory('TC limit', 80.84523695233669, 'End time:', datetime.datetime(2017, 1, 31, 20, 39, 48, 650214))
Seeking input from others as testing isn't yet my strong suit...