keep on playing : 132 games & Gauntlet

tissatussa commented 3 days ago

Fascinated ..

While testing the recent Sapling versions (upto v1.0.5, see also my #4 and #6 ) i estimate its current rating 3100. At first i picked opponent engines with rating from 2500 and up, but it became clear Sapling is much stronger, it won almost all those games. So here are 132 5m+3s games (13 with Black and 119 with White) against engines with ratings 2900 upto 3400 : here Sapling will show its limits ..

download PGN : sapling-132-games-5m+3s.zip

(when 'ECO' is missing : the game had a custom starting position)

sapling-13-games-black

sapling-119-games-white

Most engines are 3000+ -- i also let a few really weak ones play, they obscure this list a bit ..

Ratings aren't shown in this table, although this info would make the list more relevant, but i use CuteChess GUI for tournaments (and normally just single games) and this program lacks the feature to give engines a rating, at least, this is unclear to me : when doing a tournament in CuteChess (GUI) a Result List is shown which has a column called 'Elo' .. then how is this value calculated ? Also see my recent Issue https://github.com/cutechess/cutechess/issues/824 on their GitHub page -- at this moment there's no reaction. Am i missing something ?

anyhow, here's the result of a 5m+3s Gauntlet tournament, Sapling v1.0.5 playing White against 30 (mostly) equal and stronger opponents, from start position. I added their ratings to the Result List. (Sapling won where Points is 0.0)

download PGN : Sapling-v1.0.5-Gauntlet-30x-5m3s.zip

Name                          Rating   Elo     +/-   Games   Points   Score    Draw 
----------------------------------------------------------------------------------------
Sapling v1.0.5 NNUE             ....    95     113      30     19.0   63.3%   26.7% 

RubiChess 2024 NNUE             3565     0       0       1      0.5   50.0%  100.0% 
RubiChess 2024 red NNUE         3565     0       0       1      0.5   50.0%  100.0% 
Renegade v1.1.9 dev NN          3552   inf     nan       1      1.0  100.0%    0.0% 
Minic v3.38 NNUE                3546   inf     nan       1      1.0  100.0%    0.0% 
rofChade v3.0 NN                3530   inf     nan       1      1.0  100.0%    0.0% 

Midnight v9 NN                  3395     0       0       1      0.5   50.0%  100.0% 
Reckless v0.7.0 NNUE            3385   inf     nan       1      1.0  100.0%    0.0% 
Marvin v6.3.0 NNUE              3376   inf     nan       1      1.0  100.0%    0.0% 
Marvin v6.2.0 NNUE              3376     0       0       1      0.5   50.0%  100.0% 
Marvin v6.2.0 noNNUE            3376  -inf     nan       1      0.0    0.0%    0.0% 
Saturn v1.3.0 NNUE              3374     0       0       1      0.5   50.0%  100.0% 

Nemorino v6.11 NNUE             3297     0       0       1      0.5   50.0%  100.0% 
Nemorino v6.11 noNNUE           3297  -inf     nan       1      0.0    0.0%    0.0% 
RukChess v3.0.19 NNUE           3273   inf     nan       1      1.0  100.0%    0.0% 
Quanticade v0.8 Aurora NN       3250  -inf     nan       1      0.0    0.0%    0.0% 
Molybdenum v4.0 NNUE            3235   inf     nan       1      1.0  100.0%    0.0% 

Mr Bob v1.3.0 NNUE              3166     0       0       1      0.5   50.0%  100.0% 
Monty v1.0.0 MCTS PR60          3150     0       0       1      0.5   50.0%  100.0% 

Pirarucu v3.3.5                 3082  -inf     nan       1      0.0    0.0%    0.0% 
Rodent-IV v0.32 Petrosian       3033  -inf     nan       1      0.0    0.0%    0.0% 
Rodent-IV v0.32 Spassky         3033  -inf     nan       1      0.0    0.0%    0.0% 
Mayhem v8.3 noNNUE              3000  -inf     nan       1      0.0    0.0%    0.0% 
Mayhem v8.3 red NNUE            3000  -inf     nan       1      0.0    0.0%    0.0% 
Mayhem v8.3 NNUE                3000  -inf     nan       1      0.0    0.0%    0.0% 

NGplay v9.90                    2194  -inf     nan       1      0.0    0.0%    0.0% 
Nicarao v0.3.1                  2100  -inf     nan       1      0.0    0.0%    0.0% 

Satana v2.4.19                  1411  -inf     nan       1      0.0    0.0%    0.0% 

neuroGrape v1.1 HCE             ????  -inf     nan       1      0.0    0.0%    0.0% 
neuroGrape v1.1 NN              ????  -inf     nan       1      0.0    0.0%    0.0% 
NiCim v3.4.1 NNUE               ????  -inf     nan       1      0.0    0.0%    0.0%

i remember some calculation exists to determine the Gauntlet-engine rating according to such result list when all opponent ratings are known ? I guess v1.0.5 is 3100+.

what about v1.1.0 ? Is it much stronger ? Does it play differently ?

Timmoth commented 3 days ago

Amazing! Thank you :) I'm actually currently developing my own tournament runner and plan on automatically pulling the latest version of each chess engine and constantly updating it's elo as well as recording loads of useful information (average thinking time, nodes per second, depth searched etc.) It's going to be up on osccel.com (opensource computer chess engine league) pronounced /oʊ sɛl/ (oh-sell) when it's done, hopefully over the weekend.

Maybe you'd like to use that for your own experiments if it works for you, i'd be able to add the features you like.

V1.1.0 is about 30 elo stronger, the only major change was the increased network size, though I hope that with the next training round tonight it'll gain even more since there is so much more data! ~1.5bn positions

tissatussa commented 3 days ago

Amazing! Thank you :)

It's rare i spend that much 'effort' on one engine .. but in this case i also updated my engine archive a bit, using Sapling as kind of reference, but high as 132 i didn't expect - must be fun ..

I'm actually currently developing my own tournament runner and plan on automatically pulling the latest version of each chess engine and constantly updating it's elo as well as recording loads of useful information (average thinking time, nodes per second, depth searched etc.) It's going to be up on osccel.com (opensource computer chess engine league) pronounced /oʊ sɛl/ (oh-sell) when it's done, hopefully over the weekend.

i once created some terminal scripts in python to display search data when solvinf STS bm & am puzzles. You're creating all code in dot-net / C# ? Pitty, i'm only on Linux, and have trouble compiling such M$ code ..

Maybe you'd like to use that for your own experiments if it works for you, i'd be able to add the features you like.

your assets give no problems here !

V1.1.0 is about 30 elo stronger, the only major change was the increased network size, though I hope that with the next training round tonight it'll gain even more since there is so much more data! ~1.5bn positions

how do you create / select the training data ? and what about those many STS bm / am puzzles ? are they valuable ? how ? can we prove anything, or is it all statistics & self-play ? I'm tending to find positions to prove something, like "find the move within X seconds".

af995efea0bccadbee70c287e8d19b19

Timmoth commented 3 days ago

It's rare i spend that much 'effort' on one engine .. but in this case i also updated my engine archive a bit, using Sapling as kind of reference, but high as 132 i didn't expect - must be fun ..

Honestly I've been blown away by your support, you've really energized me to make the engine be the best it can!

i once created some terminal scripts in python to display search data when solvinf STS bm & am puzzles. You're creating all code in dot-net / C# ? Pitty, i'm only on Linux, and have trouble compiling such M$ code ..

Yeah, though the new system will probably have a decent amount of python too. If you're ever interested i'd be happy to help you get dotnet compilation working on linux! It should be fully supported, so probably just a build option you're missing.

how do you create / select the training data ?

There is a 'datagen' function you can run in Sapling which outputs a bullet format set of training data, in a nutshell it plays the first 9 moves randomly then plays itself and records the evaluation and result of all quiet positions to a fixed number of nodes.I usually generate around 1.5 billion positions to train a new network at the moment, I actually had to buy some new hardware for this because it's a lengthy process, usually taking 6 servers around 2 days for a new net! Though it's getting good enough now I can start to re-use data from the prev net. I then feed that data into a program called 'bullet trainer' which runs on the GPU to generate the weights, from what I can tell most people don't bother with this, they just use the 'Leeler' data set, which is absolutely amazing - but it doesn't give your engine a unique play style.

and what about those many STS bm / am puzzles ? are they valuable ? how ? can we prove anything, or is it all statistics & self-play ? I'm tending to find positions to prove something, like "find the move within X seconds".

Can you explain what you mean by this? what are 'STS bm / am puzzles' ?

tissatussa commented 3 days ago

nice info, thanks.

Can you explain what you mean by this? what are 'STS bm / am puzzles' ?

https://www.chessprogramming.org/Strategic_Test_Suite the 'WAC' file is famous, here's a newer version i once found : wac.zip 'am' is Avoid Move and 'bm' means Best Move. Look at the EPD syntax.

tissatussa commented 3 days ago

If you're ever interested i'd be happy to help you get dotnet compilation working on linux! It should be fully supported, so probably just a build option you're missing.

yes, i would really like to be able to use dot-net / C# compilation here .. indeed i must be missing something simple .. it's about conflicting dotnet versions, not in the PATH, or both .. i don't know .. and 'msbuild' is executed but doesn't exist .. i wished M$ never set foot on our base :-)

linux-rules

tissatussa commented 3 days ago

about STS : https://github.com/fsmosca/STS-Rating

Timmoth commented 2 days ago

https://www.chessprogramming.org/Strategic_Test_Suite the 'WAC' file is famous, here's a newer version i once found : wac.zip 'am' is Avoid Move and 'bm' means Best Move. Look at the EPD syntax.

That's awesome! Didn't know that existed but will give the suite a go tonight. Anything else you know like that please share, it's really useful.

yes, i would really like to be able to use dot-net / C# compilation here

Well idk if you're on discord but I'd be happy to jump on a call with you to help get it working, should take 10 mins max.