fairy-stockfish / Fairy-Stockfish

chess variant engine supporting Xiangqi, Shogi, Janggi, Makruk, S-Chess, Crazyhouse, Bughouse, and many more
https://fairy-stockfish.github.io/
GNU General Public License v3.0
611 stars 191 forks source link

How to use Fairy-Stockfish to evaluate board position. #193

Closed Trainzack closed 3 years ago

Trainzack commented 3 years ago

For a school project, we are using Fairy-Stockfish to evaluate the quality of arbitrary chess variants. To do this, we intend to have Fairy-Stockfish play hundreds of games against itself, and analyze the resulting games.

We planned on using the python binding pyffish to achieve this, but after I looked through what access the it grants, I realize that it seemingly does not have any commands to evaluate the board position. Am I wrong in my assessment, or is that not the right way to go about this?

ianfab commented 3 years ago

Yes, the python binding only exposes the game state, not search and evaluation, and basically targets similar use cases as python-chess, just with a much larger variety of games, but less functionality. A typical way to automate engine access would be to use pyffish to maintain the game state and run the native engine in a subprocess, e.g., like in https://github.com/gbtami/fairyfishnet or https://github.com/ianfab/fairyfishtest. Since pyffish so far does not provide UCI/CECP protocol support out of the box (in contrast to e.g. python-chess), the protocol communication with the engine process needs to be implemented by yourself, as in the aforementioned projects.

Sounds interesting. Will you be hand-picking the variants or (to some extent) randomly selecting the sets of rules to be tested? Have you been inspired by https://arxiv.org/abs/2009.04374?

Trainzack commented 3 years ago

Thanks for the clarification.

We intend on selecting the variants via genetic algorithm, using self-matches as part of the fitness function. We've seen that paper, but we actually only found it after we started researching for this project. I think it's most directly inspired by http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.693.4901.

ianfab commented 3 years ago

I like the idea. Designing the fitness function well should be key I guess since at least game balance, strategical depth/complexity, and rule complexity should all be quite important factors for a board game. Good luck with your project!

HGMuller commented 3 years ago

Why would you need to analyze the board position at all? It is the game result that is decisive, right? You cannot expect alpha-beta searches to produce reliable evaluations by anaylzing a position, if they are not extensively tuned for best performance in that variant. Just plugging in some randomized rules will not make it play the variant well. An engine needs more than just the game rules for that (e.g. piece values).

A practical example: the inventor of Gothic Chess has for a long time maintained that Capablanca Chess (a similar 10x8 variant) was flawed because of an unprotected Pawn in the initial position, which led to a forced winning line for white. My Capablanca engine refutes that line, though, by trading his Chancellor for an Archbishop. The assumption was that this would give white a winning advantage, because of a misconception about the Archbishop value. Selfplay of the engine shows that white actually has the advantage in this line, though, because in reality Archbishop and Chancellor are of nearly equal value, and the half-open i-file white gets in the process is more than enough compensation.

Another problem I see is that when you make Stockfsh play hundreds of games without an opening book, it would play (nearly) the same game hundreds of times. To do what you want requires some randomization by the engine. Although multi-threading effects cause some randomization, this will in general not be enough to diverge the games rapidly enough. The results are then dominated by errors the engine makes in the comon part. A variant can look totally unbalanced just because the the engine has the tendency to do a strategically flawed move in the initial position, while plenty of sound moves are available. At the very least you would need to do some MCTS for the first few moves of each game to guard against that. (E.g. similar to XBoard's Monte-Carlo book mode.)

musketeerchess commented 3 years ago

Hi HG

Considering a Forced win for white when beginning from Capablanca Chess as stated by Gothic Chess inventor, I’m curious to see this position (moves and analysis). Do you have a link?

Best regards Zied

Le 13 oct. 2020 à 09:47, HGMuller notifications@github.com a écrit :

 Why would you need to analyze the board position at all? It is the game result that is decisive, right? You cannot expect alpha-beta searches to produce reliable evaluations by anaylzing a position, if they are not extensively tuned for best performance in that variant. Just plugging in some randomized rules will not make it play the variant well. An engine needs more than just the game rules for that (e.g. piece values).

A practical example: the inventor of Gothic Chess has for a long time maintained that Capablanca Chess (a similar 10x8 variant) was flawed because of an unprotected Pawn in the initial position, which led to a forced winning line for white. My Capablanca engine refutes that line, though, by trading his Chancellor for an Archbishop. The assumption was that this would give white a winning advantage, because of a misconception about the Archbishop value. Selfplay of the engine shows that white actually has the advantage in this line, though, because in reality Archbishop and Chancellor are of nearly equal value, and the half-open i-file white gets in the process is more than enough compensation.

Another problem I see is that when you make Stockfsh play hundreds of games without an opening book, it would play (nearly) the same game hundreds of times. To do what you want requires some randomization by the engine. Although multi-threading effects cause some randomization, this will in general not be enough to diverge the games rapidly enough. The results are then dominated by errors the engine makes in the comon part. A variant can look totally unbalanced just because the the engine has the tendency to do a strategically flawed move in the initial position, while plenty of sound moves are available. At the very least you would need to do some MCTS for the first few moves of each game to guard against that. (E.g. similar to XBoard's Monte-Carlo book mode.)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

HGMuller commented 3 years ago

Hi Zied, long time no see! No, I don't remember the link, I just remember having seen it. But that was more than 10 years ago, when I was working on Joker80 (around 2008).

Trainzack commented 3 years ago

HG, you raise some valid points. Thank you for bringing them to our attention!

We could evaluate the value of the pieces ourselves using some kind of evolutionary algorithm (there has been some research done on that topic), although unfortunately that might be outside of the scope of this project due to limited time. We'll have to see just how quickly we can make progress.

As for the lack of game diversity, that's an excellent point. We hadn't taken into consideration that stockfish would likely make the same moves each time, so it would have probably taken us quite a while to figure that out. We'll put in MCTS or something to more thoroughly explore the gamespace.

HGMuller commented 3 years ago

The most efficient way to determine piece values is by direct measurement (games starting from materially imbalanced positions), rather than tweeking the engine's idea of the value (by whatever means) and hoping it will have an effect on game results. E.g. if you tweek the Queen value in an orthodox Chess engine, it doesn't have much effect on games. If two engines with unequal Queen value play each other, they will still each think trading Queens is neutral. The fraction of games where the possibility occurs to trade a Queen for 2 Rooks (let alone three minors, or R + minor + Pawn), where the engines might take a different decision, is rather small. So in perhaps 90% of the games Q will be traded for Q, and the fact that they were valued differently has no effect at all on the result. Only the remaining 10% of the games will show the effect of the different decision, which dilutes the resulting score imbalance by a factor 10. To measure it with the same relative accuracy then takes 100 times as many games. When your start position already has a Queen versus other material, you are no longer dependent on the small probability that the imbalance is created by chance. All games are then relevant for seeing how a Queen performs against the other material.

And material is just one evaluation parameter; other eval terms are usually much smaller (and thus even harder to measure), but can nevertheless be quite decisive when the material term is accurate. E.g. whether you are able to recognize passed Pawns (and give a bonus for them), you will get a pretty good score against an opponent that doesn't. As they say, "the Pawn is the soul of Chess". But if your rule variations also allow changing the Pawn move, you have no idea what pattern to test for, let alone how much it is worth when it occurs.

You should also be aware that in Chess variants piece values pretty much lose their meaning in the late end-game. Having an extra Bishop doesn't mean you are ahead in KBK or KBKP. In a variant with only minor pieces, engine self-play might create the impression that it is very drawish when an engine doesn't know they are minors. While if all 2-vs-1 end-games are won, it might not be drawish at all, provided the engine knows it should avoid trading in such a situation. Note that wrong knowledge is often more harmful to engine strength than no knowledge at all.

I am not trying to discourage you; your project is very interesting. But it is also pretty hard; there will be many pitfalls. I am just pointing out some I am aware of, so you can avoid these. I would advice that for every change in piece moves you would at least attempt to get somewhat realistic play before taking any game-play statistics at all (and never trust evaluation scores). By gauging the piece values through some materially imbalanced self-play games, build some End-Game Tables to determine which material combinations offer no winning prospects (and be sure the evaluation scores those as drawish). And perhaps it would be wise to stay away from modifying Pawns, so that you can stick to a well-developed standard evaluation for Pawn structures, rather than having to develop your own (e.g. through training neural nets).

ianfab commented 3 years ago

Considering that new piece types currently are not configurable yet (https://github.com/ianfab/Fairy-Stockfish/issues/58), my assumption was that they would just use all existing options for configurability (with a limited set of predefined pieces to choose from), and for all of them a piece value is anyway already defined, which in most cases should be quite reasonable. The piece values also get dynamically adjusted based a few severe rule changes, such as losing/anti or drop games, as well as board size modifications.

Of course it is obvious that the further away the rules are from the explicitly supported variants, the weaker the engine will play on average. After all Fairy-Stockfish does not claim to be and never was supposed to be a real GGP (general game playing) engine, it is just a very pragmatic generalization to a more or less arbitrary space of chess-related games. However, as long as the rules at least roughly follow concepts of existing chess variants, the engine should play quite well as can e.g. be seen at the example of several user defined variants on pychess that Fairy-SF had never seen during its development, and nevertheless even the time handicapped version on the server is hard to beat for strong human players. But in the end, any imperfect player can at best give an indication of the true game theoretical value of a position/game no matter its strength.

The example of Capablanca/Gothic is a bit moot in this context, since the person who claimed that (and still does so) obviously is strongly biased and avoids any fact-based discussion. In contrast Fairy-Stockfish has shown to be quite reliable when indicating forced wins in small board variants, see https://github.com/ianfab/Fairy-Stockfish/wiki/Solved-games, giving consistent results with other works solving some of those games. However, finding a forced win from the starting position is of course far out of reach for almost all even just moderately sized board variants.

HGMuller commented 3 years ago

It might be possible for Kyoto Shogi. My opening book for that gives sente an advantage of about 3 pieces against every possible defense, and when I let CrazyWa analyze one of the leaves for a day or so it often finds a forced mate.

As to the Capablanca issue: this cannot be blamed on the personaliy of the Gothic-Chess inventor. The assumption A << C was not an unreasonable one, and in fact concensus in those days. After all, everyone knows that B << R, so why would BN ~ RN? I mentioned this as an example to show how little engine evaluation means; several engines, amongst which the one made by the Gothic-Chess inventor himself, would confirm the conclusion by showing a large positive score for white. Because they all based the score on this flawed consensus value of the Archbishop. There is no hope (certainly not in those days) that an engine can search to checkmate from an early middle-game position in Capablanca Chess. And as long as you are dependent on a heuristic evaluation, it is garbage in, garbage out. The point I wanted to make is that a wrong piece value can easily lead to a wrong judgement on the variant if you trust evaluation scores. (In this case that Capablanca Chess was unplayable.)

Even nowadays Stockfish (the regular version) has much difficulty to recognize 3 Queens vs 7 Knights is badly lost for the Queens; it starts with a +8 score in favor of the Queens, I think. But if you let it play out the games, the Knights of course always win. That shows that even for orthodox pieces with uncontroversial values an engine evaluation can be completely wrong.

Beating human players, even strong ones, is not too much of a feat anymore. Even engines rated around 2400 are often already too strong for the best humans. Certainly in variants (where true grand masters usually do not exist) and even more so in especially tactical variants.

[Edit] Ah, OK, I see Kyoto Shogi is already in your list. Didn't know Fairy-Stockfish played that too. It doesn't surprise me. The given move might not be the only winning one. It might not even be the fastest win. In analysing mini-Shogi with SF I noticed that SF has the tendency to stick to a winning move once it finds one, rather than switching to an even faster winning move. Because it searches so much deeper on the best line than on the alternatives, and scores tend to rise along winning lines the deeper you get. So the deeply search winning line will always look better than the less -deeply searched better alternative.

ianfab commented 3 years ago

I think the initial question was answered, so I am closing this for now. If you have further questions or news regarding your project, feel free to comment and reopen this issue.