Open Algorhythm-sxv opened 7 months ago
So there are some technical limitations to this, in that I don't see an immediate way to tie together a PGN with any particular stderr output. So at best I would be uploading the entire contents of stderr.
What exactly is the issue with the go nodes
approach, other than the fact that you don't have such a script readily made? I can't see where having a reproducible crash can not be better than just having stderr which may or may not have anything useful in it.
My initial thought was indeed to include the entire stderr output, since that's where any language-specific crash errors will be sent. However if pumping GBs of data over stderr and messing with the server is of concern then a more limited amount of stderr could be saved. cutechess-cli
streams data into the file as it comes, so the data can then be manipulated with normal file system utilities/code.
For me currently, the primary issue with the go nodes
approach is that the requirement for go nodes
and reported nodes per move to align exactly is a quite a strong one, with some subtleties around what counts as a node, abort/timeout reports and off-by-1 errors. I have not invested effort into this previously so my engine will certainly not be able to use this approach as-is, and I'm sure I'm not alone in that regard.
My engine is written in Rust, which like many other languages outputs detailed messages over stderr when a panic/exception occurs. This alone has been enough to solve most crashes I have had, only rarely needing to resort to the debugger.
Of course perfect reproducibility is more powerful than a potentially unhelpful error message, but there is nothing to be lost from the programmer's perspective from having both. Including stderr as a PGN comment could even have everything in one file.
If implementation effort is of a concern I can try it myself and make a PR, but my general lack of Python/Django experience can't promise a high quality of implementation.
I'm mildly concerned about engines printing a ton of information to stderr. But I don't know if that is the case. And frankly it would probably be an issue if it were.
What I can do is launch cutechess, and redirect stderr to cutechess.stderr.dev.log and cutechess.stderr.base.log. Currently, when a workload finishes, the PGN is parsed. If any crashes, time losses, stalls, disconnects, or illegal moves appear in the PGN, the game is pulled out, and uploaded to the server as an error. I can attach the entire stderr log to that as well.
This will ultimately lead to additional information. IE if you managed 10 crashes on a single workload, you'll be getting 10 crashes worth of stderr in every pgn of every error report made to the server. But presumably errors are rare enough, that this is not a concern.
I would be curious though to see the 1 in 10000 PGN you have, and the branch version of your engine that it came from. I would be surprised if I could not reproduce it, even if I can't type "go nodes".
In my case the crashes are rare enough that I can't reproduce them locally, even running 20000 games so far. In my experience crashes have either been extremely rare or extremely common. The common ones are of course easy to reproduce locally but this particular one has evaded me.
Well I mean I would like to try to reproduce the game, directly from the PGN, if you have it. Will also need the branch name, network ( if any ), and the engine options like hash/thread.
Here is the PGN of the most recent crash:
This is from this test running this commit
With previous errors I have tried using pgn-extract
to set the engine up with the final position, but didn't see any errors searching there.
Based on that pgn, its nmp_tweaks
ee672c3944c738ab8686c5d3f71945a96f96083c
that had the error.
Yes you're right, I copied the wrong hash
Managed to reproduce the first 53 moves. After that the lack of fixed-node play is rough. After adding all the needed stops to get actual
So while I could reproduce a future one, this current one I'm not going to invest. But you can brute force your way to the answer if you really cared.
I appreciate the effort to try. Correcting fixed-node play is now definitely something I will invest time in, but that discussion is separate to this issue about stderr.
I have now updated my engine to accurately play to node counts, and caught another crash on this test.
using a python-chess script I can perfectly recreate the game from the PGN using node counts, but get no crash (on my system) once I search the final position.
I am fairly confident that my implementation of accurate node counting lines up between time-based and node-based aborts, as time aborts are checked when nodes % 2048 == 2047
and go movetime
always results in a node count that is one less than a multiple of 2048.
I am completely at a loss as to what might be causing this crash, and access to the stderr
output would help immensely.
Here is the python script that I am using to recreate the game:
import chess
import chess.engine
import chess.pgn
import logging
logging.basicConfig(level=logging.DEBUG)
pgn = open("error_base.pgn")
game = chess.pgn.read_game(pgn)
dev = chess.engine.SimpleEngine.popen_uci("dev.exe", debug=True)
dev.configure({"Threads": 1, "Hash": 8})
base = chess.engine.SimpleEngine.popen_uci("base.exe", debug=True)
base.configure({"Threads": 1, "Hash": 8})
board = game.board()
for node in game.mainline():
comment = node.comment
node_count = int(comment.split(' ')[3].rstrip(','))
color = not node.turn()
engine_to_play = dev
if color == chess.WHITE:
engine_to_play = base
result = engine_to_play.play(board, chess.engine.Limit(nodes=node_count))
print(node_count, node.move, result.move)
assert(result.move == node.move)
board.push(result.move)
result = base.play(board, chess.engine.Limit(time=0.01))
print(result)
base.quit()
dev.quit()
I currently have an issue with my engine that causes rare crashes in 50-move draws (~1/10000 games). OB reports the crash and the PGN of the offending game, but it completely discards the stderr output of the crashed engine, which is a very important diagnostic tool that can significantly speed up the tracking down of bugs.
Currently the only option is to ensure your engine is perfectly accurate in both the reporting of the node count and the playing of the node count with
go nodes
, and even then a separate script must be made that callsgo nodes
repeatedly and leaves the engine in the state where the crash occurred.A lot of hassle in this regard could be avoided by saving stderr with
cutechess-cli
s built-instderr
option and showing that in theErrors
view of the dashboard.