Closed sf-x closed 4 years ago
@ddugovic @ianfab It seems that the Linux subsystem's linux-ld.so
isn't working proprely. Changing LD_LIBRARY_PATH
won't get any better at all. Installing libqt5core5a
doesn't work either as the cutechess executable seems to be binary-incompatible with the package.
PS: does anyone have issues with my executables for cutechess?
Edit: formating
Edit 2: I have verified on Linux for illegal move issue, it still occurs. It's either SF sending bad PVs or cutechess that have wrong illegal move detection.
For Losers, I get a warning on each move that it is illegal but otherwise it plays as it should. I think this is a problem in the code, not in the executable.
The losers implementation is rather new and has not been heavily tested, since it is not used on lichess, but it seemed to work fine, at least I have not observed an illegal move yet. So I suspect that it is an issue with cutechess, but I am not entirely sure. If you observe an illegal PV move, please report it.
There was a hard-coded value in fishtest I was not aware of when changing the chunk size of tasks, which caused an incorrect calculation of the internal priority. This is why new tests had a much higher internal priority than older tests, which caused old tests to remain in the queue for a long time. It should be fixed now, but feedback on the distribution/priority of tasks is still welcome.
Chunk size increased from 100 to 200. Is it intentional?
Yes, it is. Due to the small chunk size, machines were switching frequently between tests, so I thought it would make sense to increase chunk size to reduce the time spent on compiling.
Right, compilation takes a big part of machine time. But you can greatly reduce the compilation time if you use "build" instead of "profile-build". In my opinion, supported by some observations on performance in Fishtest, using profile-build in testing has no particular advantage over plain build. It has the disadvantage that increases variation between machines, as each machine uses a slightly different compile adjusted to its parameters. There is another, bigger disadvantage of profile-build for some machines (e.g., mine which is Windows 8). The OS (more properly the anti-virus program F-secure) has a defence that I can't remove (to change its settings it requires admin password which I don't have), which requires a user confirmation before starting each new program loaded on the HD. This means that before starting each new compile, a popup box appears that requires me to check "Run the new program on my computer" and then click OK. On simple build this popup does not appear. That's why I can't leave my computer overnight or for longer periods of time with profile-build.
I agree, and I have also been using build
instead of profile-build
.
The illegal move topic might well be a cutechess problem. It would be nice to get a few samples of this issue especially for Losers
.
For Giveaway
and Suicide
I observed termination of games with "illegal move" adjudication in 25%-30% of my test games (15 s +150 ms/move). I then created a pull request [WIP] for cutechess with a work-around for Giveaway
and Suicide
. For these variants the PV will not be converted to SAN and tested for legality any more. The illegal move rate then dropped to 6 out of 2000 and any of these 6 cases still was a false positive. So Stockfish was fine.
I think that we should use wider bounds than the currently used [-6,2] for testing simplifications, since simplifications are more important compared to official stockfish and small regressions are less problematic. I suggest to use [-10, 5]. What do you think?
I like the motivation but I'm confused, isn't SPRT based on equality null hypotheses such that using [-10, 5] would take longer than using [-6, 2]?
I am not an expert on this, but LLR should be zero for a -2 Elo performance in the case of [-6, 2] and for -2.5 Elo in the case of [-10, 5], so small regressions are more likely to pass the latter and since the hypotheses are more different, the LLR should increase more quickly if the performance is above this threshold.
In order to have some numbers to support this reasoning, I have run this SPRT simulator for both choices of bounds:
[-6, 2]
BayesElo Elo %Pass Avg
-10.00 -6.20 0.0027 5540
-9.00 -5.58 0.0060 6280
-8.00 -4.96 0.0118 7237
-7.00 -4.34 0.0248 8460
-6.00 -3.72 0.0498 10010
-5.00 -3.10 0.0977 11962
-4.00 -2.48 0.1867 14003
-3.00 -1.86 0.3239 15734
-2.00 -1.24 0.4992 16481
-1.00 -0.62 0.6770 15813
0.00 -0.00 0.8143 14013
1.00 0.62 0.9010 11913
2.00 1.24 0.9498 10060
[-10, 5]
BayesElo Elo %Pass Avg
-10.00 -6.20 0.0473 2892
-9.00 -5.58 0.0701 3163
-8.00 -4.96 0.1033 3460
-7.00 -4.34 0.1454 3789
-6.00 -3.72 0.2023 4090
-5.00 -3.10 0.2720 4371
-4.00 -2.48 0.3562 4635
-3.00 -1.86 0.4524 4692
-2.00 -1.24 0.5503 4740
-1.00 -0.62 0.6463 4642
0.00 -0.00 0.7291 4385
1.00 0.62 0.7998 4080
2.00 1.24 0.8569 3779
As you can see the average number of games is much lower for the latter, but the probability of a regression to pass increases.
@ddugovic If you agree on the new bounds I would consider this STC test as passed (LLR >> 3 with the new bounds) and suggest you to stop STC and submit LTC with the new bounds.
OK, the bounds [-10, 5] make sense to me (small regressions are permissible) and I'll stop that STC test.
You can now filter tests by variant if you click on a variant name, e.g. http://35.161.250.236:6543/tests/variant/crazyhouse, and the variant is also displayed in the list of test results. I hope this improves clarity.
Excellent! One minor note -- if it's convenient to implement, could the page title (Stockfish Testing Queue) indicate the variant filter?
@ddugovic Good idea. It now displays the value of variant, username and/or success_only below the tittle if a filter is applied.
I have updated the regression test result page. It now labels the versions by date, adds up relative elo differences of test results, and propagates the errors. I think the graph and table are more clear now.
@ddugovic I added a static cutechess-cli.exe for Windows and a cutechess-cli for Linux built with Qt5 static. Now fishtest runs fine on several platforms: Windows MinGW, Ubuntu 14.04 and newer, CentOS 7, Linux Subsystem for Windows. On Windows 10 LSW is the faster way.
The queue is running empty more and more often in the last days, because I am busy with other things. It would be great to have more people contributing tests and tuning attempts.
@ianfab @ddugovic : chapeaux, you two guys have done an extraordinary work. In order to try to involve other developers, perhaps some marketing post on lichess, talkchess? IMO it will be useful to have a space where ask questions about the project.
In general the Lichess Feedback forum has served as a useful space... though it sounds like people aren't satisfied with that & people want a chat room?
Perhaps for that purpose the Lichess discord will suffice until people who aren't busy create and moderate a dedicated place for discussion...
I already posted some advertising on SF forum : https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/lvp7FNMRWII
Right, although I assume advertisements attract users moreso than developers, and they encourage tens of thousands of active Lichess users reporting issues to bypass the Lichess Feedback forum. That forum is helpful since I don't have an iPad, iPhone, Macbook, etc. and I can't keep up with all the hardware and browser issues like Lichess can.
The other problem is that for new developers there is a learning curve and skilled people tend to be busy; it's not clear what a new developer would do... maybe parameter tuning & testing? But people tend to avoid such tedious work.
I started writing some pages in the Wiki to launching the worker on different OS.
@ppigazzini Thanks for the kind words and your efforts in extending the wiki.
@ddugovic There is a learning curve, and developing a strong engine can be tedious, but at least in my experience it is possible to get familiar with small parts of the Stockfish code without having to understand much of the rest. When I contributed my first evaluation patches to this repository, I had basically only looked at evaluate.cpp
and not much else and I did not have (much) prior knowledge of the Stockfish code. Furthermore, I think developing official stockfish is much more tedious, since it is incredibly difficult to find improvements there, whereas the history of my fishtest instance shows that ~40% (without regression and LTC tests it probably is less, but anyway) of our tests are succesful.
@ianfab I wrote some wiki pages of Official Stockfish so it's a simply cut&paste work with some text reviewing. I will open a couple of PR for mv fishtest in order to use MSYS2 python, simplifying the worker installation on Windows.
@ddugovic @ianfab please review the wiki page Creating my first test. I updated the SPRT bounds checking the tests history, but I don't know all the changes done respect to the official Fistest. Should be useful to have some background/reference info in Home; if I'm not wrong, Multi Variant Stockfish is used in these lichess.org projects (by @niklasf):
Thanks, looks great! I have updated the Contributor Guidelines section.
Added Installing a development multi variant fishtest server on Ubuntu with the updated installation script.
In terms of strength I am most concerned about popular variants:
Blitz: 80,496 players Crazyhouse: 3,446 players Antichess: 2,526 players(!) Atomic: 1,757 players Others: < 1,200 players
In light of this and of issue #278 , perhaps we should consider that for future PRs/tests -- other than standard, crazyhouse, antichess, and atomic -- should also not cause more than a 1% slowdown for make build
benchmarks. (No current PR is in danger of this, however by this standard #281 seems infeasible.)
I have slightly adjusted the test submission page, so that the bounds and the select boxes are not reset when changing time control or when rescheduling tests, since I found this behavior a bit annoying. Rescheduling for LTC should now be doable with two or three clicks. Please let me know if it does not work properly.
I consider updating the cutechess compiles for fishtest with the PR https://github.com/cutechess/cutechess/pull/200 in order to be able to test giveaway and suicide chess on fishtest.
@ppigazzini If it is not too much trouble, could you please compile cutechess with the PR included and open a PR in my fishtest repo?
@ianfab PR opened on your FishCooking repo and your fishtest repo: close the wrong one.
I use linux containers to build and test on different distro/versions of linux.
@ppigazzini Thanks a lot. I'll merge it, adjust fishtest for the new variants, and submit some tests to verify that fishtest is working for losing chess variants. I think I might have to bump the worker version to force the workers to download the new cutechess compiles.
Enforcing the update of cutechess compiles on all workers without manually accessing them required multiple updates of the worker version. Sorry if this caused any issues. It should work fine now, and similar changes should cause less problems in the future because of adjustments of the worker code.
I am doing some verification tests for giveaway. The results are looking good so far.
For ease of use I have added a few buttons to fishtest to set SPRT bounds and the opening book name.
FYI, although I am no longer handling merges from upstream, until someone merges upstream changes future PRs will remain open (unmerged).
FYI, although I am no longer handling merges from upstream, until someone merges upstream changes future PRs will remain open (unmerged).
I can try to do some upstream merges even though I do not like that it sounds like blackmailing, where you could have simply asked me. I can not promise that I will have time to do all merges in addition to filling the queue with tests and maintaining fishtest.
If you are busy, I understand that you do not have enough time to do all the upstream merges, but I am surprised that you announce not to do any upstream merges any more. May I ask what lead you to this decision?
I am busy, but also the repository keeps growing as gaining Elo is the only consideration and it is difficult to control for complexity, let alone for quality (in terms of blatantly incorrect perft and search results). I'm not saying anyone has to help, just that the project is growing and as I become busier I don't know what to do with it.
I am busy, but also the repository keeps growing as gaining Elo is the only consideration
I would claim that my considerations are much more nuanced than that. Just because I evaluate the complexity of the code in a different way than you (not claiming that mine is the "right" way, this simply is subjective), that does not mean that I do not consider it. If this is about losers chess, I can just repeat myself once more that I was mainly concerned about the workflow. Testing guidelines have to be a bit unflexible, because they are mainly there to control the unavoidable bias of developers. Of course one can make exceptions, but only after very careful reasoning in order to retain control of the bias.
... quality (in terms of blatantly incorrect perft and search results).
It has a lot of weaknesses and even some bugs, but I am wondering what you mean by "blatantly incorrect perft and search results".
Fishtest stopped working a few hours ago due to a crash of mongodb. It should be up again now.
Sorry, there still seem to be some issues and I will not be able to access the server during the next few hours, but I will try to fix it as soon as possible.
Unfortunately, neither automatic repairing nor manually deleting corrupted data nor merging a backup helped, so I finally had to completely restore the database from a backup. This means that some data was lost, but at least it seems to be working again. Sorry for the inconvenience.
Here is a backup of the HTML page, where you can see the test results that have been lost in the database:
When I tuned PSQTs for several variants I wanted to avoid the tedious and error prone work of manually copying hundreds of tuning results into the code, so as a result of this I created a script that automatically replaces the values in the code given the tuning results, which might also be useful for others: https://github.com/ianfab/fishutils. It still has several limitations since it does not actually parse C++ code but only searches for regular expression patterns, but so far it has worked well for me in most cases.
Ah, interesting... that means it isn't necessary to rebase from the tune_variant
branch (which I periodically rebase upon master
).
@ddugovic I am not sure I understand your comment. What the script does is that it gets an input like param: mLever[4], best: 27.00, start: 17.00, min: -100.00, max: 200.00, c 41.714131, a 223.069148
(copied from tuning results in fishtest), then finds the array Lever
in pawns.cpp
, and changes the middlegame value of the array entry with index 4 in the code from 17 to 27, so you do not have to do this code change manually (which can be tedious if it is not only one parameter as in the example but maybe hundreds).
Oh! Then I entirely misunderstood (I thought we were referring to the tuning instrumentation code - I overlooked "tuning results"). Now that I think more about it, that code is in tune.h
and tune.cpp
respectively...
Thanks @ianfab and I very much look forward to using your tool!
@ddugovic I think there are several ways to fix this. If
LD_LIBRARY_PATH
is set to the directory wherelibQt5Core.so.5
can be found (delivered with cutechess-cli-linux-64.zip), either manually by you or inside the fishtest worker code, it should work. Alternatively, installinglibqt5core5a
should solve it, too. I am however not sure whether there are side effects if I setLD_LIBRARY_PATH
inside the worker code, at least it would probably have to check first whether it is under linux.