ddugovic / Stockfish

Retired multi-variant fork of popular UCI chess engine; please use Fairy-Stockfish instead
https://github.com/ianfab/Fairy-Stockfish
GNU General Public License v3.0
133 stars 44 forks source link

What conditions should be used for testing changes? #149

Closed sf-x closed 4 years ago

ianfab commented 7 years ago

@ddugovic I think there are several ways to fix this. If LD_LIBRARY_PATH is set to the directory where libQt5Core.so.5 can be found (delivered with cutechess-cli-linux-64.zip), either manually by you or inside the fishtest worker code, it should work. Alternatively, installing libqt5core5a should solve it, too. I am however not sure whether there are side effects if I set LD_LIBRARY_PATH inside the worker code, at least it would probably have to check first whether it is under linux.

theo77186 commented 7 years ago

@ddugovic @ianfab It seems that the Linux subsystem's linux-ld.so isn't working proprely. Changing LD_LIBRARY_PATH won't get any better at all. Installing libqt5core5a doesn't work either as the cutechess executable seems to be binary-incompatible with the package. PS: does anyone have issues with my executables for cutechess? Edit: formating Edit 2: I have verified on Linux for illegal move issue, it still occurs. It's either SF sending bad PVs or cutechess that have wrong illegal move detection.

lantonov commented 7 years ago

For Losers, I get a warning on each move that it is illegal but otherwise it plays as it should. I think this is a problem in the code, not in the executable.

ianfab commented 7 years ago

The losers implementation is rather new and has not been heavily tested, since it is not used on lichess, but it seemed to work fine, at least I have not observed an illegal move yet. So I suspect that it is an issue with cutechess, but I am not entirely sure. If you observe an illegal PV move, please report it.

ianfab commented 7 years ago

There was a hard-coded value in fishtest I was not aware of when changing the chunk size of tasks, which caused an incorrect calculation of the internal priority. This is why new tests had a much higher internal priority than older tests, which caused old tests to remain in the queue for a long time. It should be fixed now, but feedback on the distribution/priority of tasks is still welcome.

lantonov commented 7 years ago

Chunk size increased from 100 to 200. Is it intentional?

ianfab commented 7 years ago

Yes, it is. Due to the small chunk size, machines were switching frequently between tests, so I thought it would make sense to increase chunk size to reduce the time spent on compiling.

lantonov commented 7 years ago

Right, compilation takes a big part of machine time. But you can greatly reduce the compilation time if you use "build" instead of "profile-build". In my opinion, supported by some observations on performance in Fishtest, using profile-build in testing has no particular advantage over plain build. It has the disadvantage that increases variation between machines, as each machine uses a slightly different compile adjusted to its parameters. There is another, bigger disadvantage of profile-build for some machines (e.g., mine which is Windows 8). The OS (more properly the anti-virus program F-secure) has a defence that I can't remove (to change its settings it requires admin password which I don't have), which requires a user confirmation before starting each new program loaded on the HD. This means that before starting each new compile, a popup box appears that requires me to check "Run the new program on my computer" and then click OK. On simple build this popup does not appear. That's why I can't leave my computer overnight or for longer periods of time with profile-build.

ianfab commented 7 years ago

I agree, and I have also been using build instead of profile-build.

alwey commented 7 years ago

The illegal move topic might well be a cutechess problem. It would be nice to get a few samples of this issue especially for Losers.

For Giveaway and Suicide I observed termination of games with "illegal move" adjudication in 25%-30% of my test games (15 s +150 ms/move). I then created a pull request [WIP] for cutechess with a work-around for Giveaway and Suicide. For these variants the PV will not be converted to SAN and tested for legality any more. The illegal move rate then dropped to 6 out of 2000 and any of these 6 cases still was a false positive. So Stockfish was fine.

ianfab commented 7 years ago

I think that we should use wider bounds than the currently used [-6,2] for testing simplifications, since simplifications are more important compared to official stockfish and small regressions are less problematic. I suggest to use [-10, 5]. What do you think?

ddugovic commented 7 years ago

I like the motivation but I'm confused, isn't SPRT based on equality null hypotheses such that using [-10, 5] would take longer than using [-6, 2]?

ianfab commented 7 years ago

I am not an expert on this, but LLR should be zero for a -2 Elo performance in the case of [-6, 2] and for -2.5 Elo in the case of [-10, 5], so small regressions are more likely to pass the latter and since the hypotheses are more different, the LLR should increase more quickly if the performance is above this threshold.

In order to have some numbers to support this reasoning, I have run this SPRT simulator for both choices of bounds:

[-6, 2]

BayesElo    Elo %Pass   Avg
-10.00  -6.20   0.0027  5540
-9.00   -5.58   0.0060  6280
-8.00   -4.96   0.0118  7237
-7.00   -4.34   0.0248  8460
-6.00   -3.72   0.0498  10010
-5.00   -3.10   0.0977  11962
-4.00   -2.48   0.1867  14003
-3.00   -1.86   0.3239  15734
-2.00   -1.24   0.4992  16481
-1.00   -0.62   0.6770  15813
0.00    -0.00   0.8143  14013
1.00    0.62    0.9010  11913
2.00    1.24    0.9498  10060

[-10, 5]

BayesElo    Elo %Pass   Avg
-10.00  -6.20   0.0473  2892
-9.00   -5.58   0.0701  3163
-8.00   -4.96   0.1033  3460
-7.00   -4.34   0.1454  3789
-6.00   -3.72   0.2023  4090
-5.00   -3.10   0.2720  4371
-4.00   -2.48   0.3562  4635
-3.00   -1.86   0.4524  4692
-2.00   -1.24   0.5503  4740
-1.00   -0.62   0.6463  4642
0.00    -0.00   0.7291  4385
1.00    0.62    0.7998  4080
2.00    1.24    0.8569  3779

As you can see the average number of games is much lower for the latter, but the probability of a regression to pass increases.

ianfab commented 7 years ago

@ddugovic If you agree on the new bounds I would consider this STC test as passed (LLR >> 3 with the new bounds) and suggest you to stop STC and submit LTC with the new bounds.

ddugovic commented 7 years ago

OK, the bounds [-10, 5] make sense to me (small regressions are permissible) and I'll stop that STC test.

ianfab commented 7 years ago

You can now filter tests by variant if you click on a variant name, e.g. http://35.161.250.236:6543/tests/variant/crazyhouse, and the variant is also displayed in the list of test results. I hope this improves clarity.

ddugovic commented 7 years ago

Excellent! One minor note -- if it's convenient to implement, could the page title (Stockfish Testing Queue) indicate the variant filter?

ianfab commented 7 years ago

@ddugovic Good idea. It now displays the value of variant, username and/or success_only below the tittle if a filter is applied.

ianfab commented 7 years ago

I have updated the regression test result page. It now labels the versions by date, adds up relative elo differences of test results, and propagates the errors. I think the graph and table are more clear now.

ppigazzini commented 7 years ago

@ddugovic I added a static cutechess-cli.exe for Windows and a cutechess-cli for Linux built with Qt5 static. Now fishtest runs fine on several platforms: Windows MinGW, Ubuntu 14.04 and newer, CentOS 7, Linux Subsystem for Windows. On Windows 10 LSW is the faster way.

ianfab commented 7 years ago

The queue is running empty more and more often in the last days, because I am busy with other things. It would be great to have more people contributing tests and tuning attempts.

ppigazzini commented 7 years ago

@ianfab @ddugovic : chapeaux, you two guys have done an extraordinary work. In order to try to involve other developers, perhaps some marketing post on lichess, talkchess? IMO it will be useful to have a space where ask questions about the project.

ddugovic commented 7 years ago

In general the Lichess Feedback forum has served as a useful space... though it sounds like people aren't satisfied with that & people want a chat room?

Perhaps for that purpose the Lichess discord will suffice until people who aren't busy create and moderate a dedicated place for discussion...

Vinvin20 commented 7 years ago

I already posted some advertising on SF forum : https://groups.google.com/forum/?fromgroups=#!topic/fishcooking/lvp7FNMRWII

ddugovic commented 7 years ago

Right, although I assume advertisements attract users moreso than developers, and they encourage tens of thousands of active Lichess users reporting issues to bypass the Lichess Feedback forum. That forum is helpful since I don't have an iPad, iPhone, Macbook, etc. and I can't keep up with all the hardware and browser issues like Lichess can.

The other problem is that for new developers there is a learning curve and skilled people tend to be busy; it's not clear what a new developer would do... maybe parameter tuning & testing? But people tend to avoid such tedious work.

ppigazzini commented 7 years ago

I started writing some pages in the Wiki to launching the worker on different OS.

ianfab commented 7 years ago

@ppigazzini Thanks for the kind words and your efforts in extending the wiki.

@ddugovic There is a learning curve, and developing a strong engine can be tedious, but at least in my experience it is possible to get familiar with small parts of the Stockfish code without having to understand much of the rest. When I contributed my first evaluation patches to this repository, I had basically only looked at evaluate.cpp and not much else and I did not have (much) prior knowledge of the Stockfish code. Furthermore, I think developing official stockfish is much more tedious, since it is incredibly difficult to find improvements there, whereas the history of my fishtest instance shows that ~40% (without regression and LTC tests it probably is less, but anyway) of our tests are succesful.

ppigazzini commented 7 years ago

@ianfab I wrote some wiki pages of Official Stockfish so it's a simply cut&paste work with some text reviewing. I will open a couple of PR for mv fishtest in order to use MSYS2 python, simplifying the worker installation on Windows.

ppigazzini commented 7 years ago

@ddugovic @ianfab please review the wiki page Creating my first test. I updated the SPRT bounds checking the tests history, but I don't know all the changes done respect to the official Fistest. Should be useful to have some background/reference info in Home; if I'm not wrong, Multi Variant Stockfish is used in these lichess.org projects (by @niklasf):

ddugovic commented 7 years ago

Thanks, looks great! I have updated the Contributor Guidelines section.

ppigazzini commented 7 years ago

Added Installing a development multi variant fishtest server on Ubuntu with the updated installation script.

ddugovic commented 7 years ago

In terms of strength I am most concerned about popular variants:

Blitz: 80,496 players Crazyhouse: 3,446 players Antichess: 2,526 players(!) Atomic: 1,757 players Others: < 1,200 players

In light of this and of issue #278 , perhaps we should consider that for future PRs/tests -- other than standard, crazyhouse, antichess, and atomic -- should also not cause more than a 1% slowdown for make build benchmarks. (No current PR is in danger of this, however by this standard #281 seems infeasible.)

ianfab commented 7 years ago

I have slightly adjusted the test submission page, so that the bounds and the select boxes are not reset when changing time control or when rescheduling tests, since I found this behavior a bit annoying. Rescheduling for LTC should now be doable with two or three clicks. Please let me know if it does not work properly.

ianfab commented 7 years ago

I consider updating the cutechess compiles for fishtest with the PR https://github.com/cutechess/cutechess/pull/200 in order to be able to test giveaway and suicide chess on fishtest.

@ppigazzini If it is not too much trouble, could you please compile cutechess with the PR included and open a PR in my fishtest repo?

ppigazzini commented 7 years ago

@ianfab PR opened on your FishCooking repo and your fishtest repo: close the wrong one.

I use linux containers to build and test on different distro/versions of linux.

ianfab commented 7 years ago

@ppigazzini Thanks a lot. I'll merge it, adjust fishtest for the new variants, and submit some tests to verify that fishtest is working for losing chess variants. I think I might have to bump the worker version to force the workers to download the new cutechess compiles.

ianfab commented 7 years ago

Enforcing the update of cutechess compiles on all workers without manually accessing them required multiple updates of the worker version. Sorry if this caused any issues. It should work fine now, and similar changes should cause less problems in the future because of adjustments of the worker code.

I am doing some verification tests for giveaway. The results are looking good so far.

ianfab commented 7 years ago

For ease of use I have added a few buttons to fishtest to set SPRT bounds and the opening book name.

ddugovic commented 7 years ago

FYI, although I am no longer handling merges from upstream, until someone merges upstream changes future PRs will remain open (unmerged).

ianfab commented 7 years ago

FYI, although I am no longer handling merges from upstream, until someone merges upstream changes future PRs will remain open (unmerged).

I can try to do some upstream merges even though I do not like that it sounds like blackmailing, where you could have simply asked me. I can not promise that I will have time to do all merges in addition to filling the queue with tests and maintaining fishtest.

If you are busy, I understand that you do not have enough time to do all the upstream merges, but I am surprised that you announce not to do any upstream merges any more. May I ask what lead you to this decision?

ddugovic commented 7 years ago

I am busy, but also the repository keeps growing as gaining Elo is the only consideration and it is difficult to control for complexity, let alone for quality (in terms of blatantly incorrect perft and search results). I'm not saying anyone has to help, just that the project is growing and as I become busier I don't know what to do with it.

ianfab commented 7 years ago

I am busy, but also the repository keeps growing as gaining Elo is the only consideration

I would claim that my considerations are much more nuanced than that. Just because I evaluate the complexity of the code in a different way than you (not claiming that mine is the "right" way, this simply is subjective), that does not mean that I do not consider it. If this is about losers chess, I can just repeat myself once more that I was mainly concerned about the workflow. Testing guidelines have to be a bit unflexible, because they are mainly there to control the unavoidable bias of developers. Of course one can make exceptions, but only after very careful reasoning in order to retain control of the bias.

... quality (in terms of blatantly incorrect perft and search results).

It has a lot of weaknesses and even some bugs, but I am wondering what you mean by "blatantly incorrect perft and search results".

ianfab commented 7 years ago

Fishtest stopped working a few hours ago due to a crash of mongodb. It should be up again now.

ianfab commented 7 years ago

Sorry, there still seem to be some issues and I will not be able to access the server during the next few hours, but I will try to fix it as soon as possible.

ianfab commented 7 years ago

Unfortunately, neither automatic repairing nor manually deleting corrupted data nor merging a backup helped, so I finally had to completely restore the database from a backup. This means that some data was lost, but at least it seems to be working again. Sorry for the inconvenience.

ianfab commented 7 years ago

Here is a backup of the HTML page, where you can see the test results that have been lost in the database:

Stockfish Testing Framework.html.txt

ianfab commented 7 years ago

When I tuned PSQTs for several variants I wanted to avoid the tedious and error prone work of manually copying hundreds of tuning results into the code, so as a result of this I created a script that automatically replaces the values in the code given the tuning results, which might also be useful for others: https://github.com/ianfab/fishutils. It still has several limitations since it does not actually parse C++ code but only searches for regular expression patterns, but so far it has worked well for me in most cases.

ddugovic commented 7 years ago

Ah, interesting... that means it isn't necessary to rebase from the tune_variant branch (which I periodically rebase upon master).

ianfab commented 7 years ago

@ddugovic I am not sure I understand your comment. What the script does is that it gets an input like param: mLever[4], best: 27.00, start: 17.00, min: -100.00, max: 200.00, c 41.714131, a 223.069148 (copied from tuning results in fishtest), then finds the array Lever in pawns.cpp, and changes the middlegame value of the array entry with index 4 in the code from 17 to 27, so you do not have to do this code change manually (which can be tedious if it is not only one parameter as in the example but maybe hundreds).

ddugovic commented 7 years ago

Oh! Then I entirely misunderstood (I thought we were referring to the tuning instrumentation code - I overlooked "tuning results"). Now that I think more about it, that code is in tune.h and tune.cpp respectively...

Thanks @ianfab and I very much look forward to using your tool!