ASPP / pelita

Actor-based Toolkit for Interactive Language Education in Python
https://github.com/ASPP/pelita_template
Other
61 stars 68 forks source link

Add error count to contrib/ci_engine #802

Closed Debilski closed 2 weeks ago

Debilski commented 1 month ago

The CI engine should take note when a bot fails with a fatal error to help with debugging. (Also the seed should be stored with the game info.)

otizonaizit commented 1 month ago

well yes, the CI engine needs a serious revamp... should I work on it or are you already doing stuff? I'd like for the thing to at least read the same conf file as the pelita-server, so that we don't have to specify the list of players twice.

Debilski commented 1 month ago

Yeah, I’ll do some minor refactorings later to make it more useful. Scores so far:

                # name matches score (1/0/-1)
                aspp2021_4  219   0.88
                aspp2023_2  217   0.75
                aspp2022_2  218   0.64
                aspp2021_3  218   0.61
                aspp2021_1  217   0.45
                aspp2019_3  218   0.37
                aspp2022_0  217   0.31
                aspp2023_4  217   0.18
                aspp2021_0  217  -0.02
                aspp2021_2  218  -0.32
                aspp2019_1  218  -0.37
                aspp2022_1  217  -0.41
                aspp2019_4  217  -0.47
                aspp2022_4  218  -0.49
                aspp2022_3  217  -0.58
                aspp2019_2  217  -0.59
                aspp2019_0  218  -0.94
otizonaizit commented 1 month ago

oh, and the TU players are not performing at all? Wouldn't it be easier to interpret the results if instead of score one would show percent-win? Otherwise it is difficult to distinguish a bot who draws all the time versus a bot who either wins or loses with 50% probability. Also percent-win will then be more independent from the number of matches than score is...

Debilski commented 1 month ago

(I forgot to add the TU players to the config)

Yeah, the output has a much bigger table with all this info (more or less). But it is 2d and needs to be shrunk :)

Debilski commented 1 month ago

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                         ┃ # Matches ┃ # Wins ┃ # Draws ┃ # Losses ┃ Score                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ aspp2021_4                   │ 262       │ 233    │ 6       │ 23       │ 0.8015267175572519   │
│ tube2024_0                   │ 139       │ 118    │ 8       │ 13       │ 0.7553956834532374   │
│ bayes_avengers               │ 139       │ 116    │ 8       │ 15       │ 0.7266187050359713   │
│ aspp2023_2                   │ 248       │ 201    │ 8       │ 39       │ 0.6532258064516129   │
│ aspp2021_3                   │ 251       │ 191    │ 3       │ 57       │ 0.5338645418326693   │
│ aspp2022_2                   │ 266       │ 200    │ 1       │ 65       │ 0.5075187969924813   │
│ tube2024_1                   │ 139       │ 99     │ 4       │ 36       │ 0.45323741007194246  │
│ shake_dat_botty              │ 139       │ 95     │ 3       │ 41       │ 0.38848920863309355  │
│ aspp2021_1                   │ 256       │ 174    │ 6       │ 76       │ 0.3828125            │
│ aspp2019_3                   │ 257       │ 169    │ 3       │ 85       │ 0.32684824902723736  │
│ trilobots                    │ 138       │ 80     │ 7       │ 51       │ 0.21014492753623187  │
│ aspp2022_0                   │ 251       │ 146    │ 8       │ 97       │ 0.1952191235059761   │
│ tube2024_3                   │ 139       │ 80     │ 3       │ 56       │ 0.17266187050359713  │
│ aspp2023_4                   │ 258       │ 131    │ 14      │ 113      │ 0.06976744186046512  │
│ too_bot_to_handle            │ 140       │ 72     │ 2       │ 66       │ 0.04285714285714286  │
│ aspp2021_0                   │ 242       │ 97     │ 10      │ 135      │ -0.15702479338842976 │
│ drbabydangers                │ 138       │ 48     │ 18      │ 72       │ -0.17391304347826086 │
│ group4_2022_this_time_moving │ 138       │ 53     │ 6       │ 79       │ -0.18840579710144928 │
│ dogues_de_bordeaux           │ 138       │ 43     │ 4       │ 91       │ -0.34782608695652173 │
│ aspp2021_2                   │ 256       │ 68     │ 29      │ 159      │ -0.35546875          │
│ tube2024_2                   │ 139       │ 41     │ 6       │ 92       │ -0.3669064748201439  │
│ aspp2019_1                   │ 243       │ 35     │ 67      │ 141      │ -0.43621399176954734 │
│ aspp2022_1                   │ 266       │ 53     │ 32      │ 181      │ -0.48120300751879697 │
│ aspp2022_4                   │ 244       │ 29     │ 43      │ 172      │ -0.5860655737704918  │
│ aspp2019_4                   │ 245       │ 48     │ 1       │ 196      │ -0.6040816326530613  │
│ aspp2022_3                   │ 251       │ 44     │ 6       │ 201      │ -0.6254980079681275  │
│ aspp2019_2                   │ 256       │ 35     │ 2       │ 219      │ -0.71875             │
│ aspp2019_0                   │ 139       │ 2      │ 8       │ 129      │ -0.9136690647482014  │
└──────────────────────────────┴───────────┴────────┴─────────┴──────────┴──────────────────────┘

aspp2019_0 is definitely a little underwhelming? (They remove nodes with enemy bots from the graph and simply stop when this means that the graph is disconnected. I’m close to helping them out a little to perform better. :) )

otizonaizit commented 1 month ago

don't you think it would make sense to change the logic of the CI to try to even out the number of matches played? If a good team enters the CI later, then the ranking is skewed towards the good teams that had a chance to play more matches and it will take a very long time to even out that effect. Or the raking should be based on percent-win.

On Fri 28 Jun, 08:58 +0000, Rike-Benjamin Schuppner @.***> wrote:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Name ┃ # Matches ┃ # Wins ┃ # Draws ┃ # Losses ┃ Score ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ aspp2021_4 │ 262 │ 233 │ 6 │ 23 │ 0.8015267175572519 │ │ tube2024_0 │ 139 │ 118 │ 8 │ 13 │ 0.7553956834532374 │ │ bayes_avengers │ 139 │ 116 │ 8 │ 15 │ 0.7266187050359713 │ │ aspp2023_2 │ 248 │ 201 │ 8 │ 39 │ 0.6532258064516129 │ │ aspp2021_3 │ 251 │ 191 │ 3 │ 57 │ 0.5338645418326693 │ │ aspp2022_2 │ 266 │ 200 │ 1 │ 65 │ 0.5075187969924813 │ │ tube2024_1 │ 139 │ 99 │ 4 │ 36 │ 0.45323741007194246 │ │ shake_dat_botty │ 139 │ 95 │ 3 │ 41 │ 0.38848920863309355 │ │ aspp2021_1 │ 256 │ 174 │ 6 │ 76 │ 0.3828125 │ │ aspp2019_3 │ 257 │ 169 │ 3 │ 85 │ 0.32684824902723736 │ │ trilobots │ 138 │ 80 │ 7 │ 51 │ 0.21014492753623187 │ │ aspp2022_0 │ 251 │ 146 │ 8 │ 97 │ 0.1952191235059761 │ │ tube2024_3 │ 139 │ 80 │ 3 │ 56 │ 0.17266187050359713 │ │ aspp2023_4 │ 258 │ 131 │ 14 │ 113 │ 0.06976744186046512 │ │ too_bot_to_handle │ 140 │ 72 │ 2 │ 66 │ 0.04285714285714286 │ │ aspp2021_0 │ 242 │ 97 │ 10 │ 135 │ -0.15702479338842976 │ │ drbabydangers │ 138 │ 48 │ 18 │ 72 │ -0.17391304347826086 │ │ group4_2022_this_time_moving │ 138 │ 53 │ 6 │ 79 │ -0.18840579710144928 │ │ dogues_de_bordeaux │ 138 │ 43 │ 4 │ 91 │ -0.34782608695652173 │ │ aspp2021_2 │ 256 │ 68 │ 29 │ 159 │ -0.35546875 │ │ tube2024_2 │ 139 │ 41 │ 6 │ 92 │ -0.3669064748201439 │ │ aspp2019_1 │ 243 │ 35 │ 67 │ 141 │ -0.43621399176954734 │ │ aspp2022_1 │ 266 │ 53 │ 32 │ 181 │ -0.48120300751879697 │ │ aspp2022_4 │ 244 │ 29 │ 43 │ 172 │ -0.5860655737704918 │ │ aspp2019_4 │ 245 │ 48 │ 1 │ 196 │ -0.6040816326530613 │ │ aspp2022_3 │ 251 │ 44 │ 6 │ 201 │ -0.6254980079681275 │ │ aspp2019_2 │ 256 │ 35 │ 2 │ 219 │ -0.71875 │ │ aspp2019_0 │ 139 │ 2 │ 8 │ 129 │ -0.9136690647482014 │ └──────────────────────────────┴───────────┴────────┴─────────┴──────────┴──────────────────────┘

aspp2019_0 is definitely a little underwhelming? (They remove nodes with enemy bots from the graph and simply stop when this means that the graph is disconnected. I’m close to helping them out a little to perform better. :) )

— Reply to this email directly, view it on GitHub¹, or unsubscribe². You are receiving this because you commented.☘Message ID: @.***>

––––

¹ https://github.com/ASPP/pelita/issues/802#issuecomment-2197227058 ² https://github.com/notifications/unsubscribe-auth/AACUYC5GK4JMYADD773UADLZJWBYZAVCNFSM6AAAAABJ7QNSNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGIZDOMBVHA

Debilski commented 1 month ago

But the logic already does that. It just takes a while.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━┓
┃ Name                         ┃ # Matches ┃ # Wins ┃ # Draws ┃ # Losses ┃ Score  ┃ ELO  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━┩
│ aspp2021_4                   │ 669       │ 569    │ 25      │ 75       │  0.738 │ 1965 │
│ tube2024_0                   │ 669       │ 563    │ 35      │ 71       │  0.735 │ 1993 │
│ bayes_avengers               │ 669       │ 561    │ 37      │ 71       │  0.732 │ 1920 │
│ aspp2023_2                   │ 669       │ 524    │ 22      │ 123      │  0.599 │ 1839 │
│ tube2024_1                   │ 668       │ 470    │ 27      │ 171      │  0.448 │ 1858 │
│ aspp2021_3                   │ 670       │ 479    │ 11      │ 180      │  0.446 │ 1722 │
│ shake_dat_botty              │ 670       │ 464    │ 12      │ 194      │  0.403 │ 1680 │
│ aspp2022_2                   │ 670       │ 457    │ 11      │ 202      │  0.381 │ 1767 │
│ aspp2021_1                   │ 670       │ 422    │ 10      │ 238      │  0.275 │ 1645 │
│ aspp2019_3                   │ 669       │ 418    │ 9       │ 242      │  0.263 │ 1616 │
│ trilobots                    │ 668       │ 401    │ 22      │ 245      │  0.234 │ 1697 │
│ aspp2022_0                   │ 670       │ 398    │ 21      │ 251      │  0.219 │ 1528 │
│ tube2024_3                   │ 671       │ 400    │ 10      │ 261      │  0.207 │ 1633 │
│ too_bot_to_handle            │ 668       │ 359    │ 6       │ 303      │  0.084 │ 1566 │
│ aspp2023_4                   │ 669       │ 321    │ 25      │ 323      │ -0.003 │ 1415 │
│ group4_2022_this_time_moving │ 669       │ 304    │ 22      │ 343      │ -0.058 │ 1492 │
│ aspp2021_0                   │ 669       │ 262    │ 19      │ 388      │ -0.188 │ 1362 │
│ drbabydangers                │ 671       │ 222    │ 84      │ 365      │ -0.213 │ 1381 │
│ dogues_de_bordeaux           │ 669       │ 254    │ 12      │ 403      │ -0.223 │ 1454 │
│ aspp2021_2                   │ 669       │ 198    │ 47      │ 424      │ -0.338 │ 1327 │
│ tube2024_2                   │ 668       │ 187    │ 35      │ 446      │ -0.388 │ 1272 │
│ aspp2019_1                   │ 669       │ 103    │ 179     │ 387      │ -0.425 │ 1262 │
│ aspp2022_1                   │ 669       │ 123    │ 97      │ 449      │ -0.487 │ 1228 │
│ aspp2022_4                   │ 669       │ 81     │ 104     │ 484      │ -0.602 │ 1182 │
│ aspp2022_3                   │ 669       │ 121    │ 14      │ 534      │ -0.617 │ 1147 │
│ aspp2019_4                   │ 669       │ 120    │ 8       │ 541      │ -0.629 │ 1146 │
│ aspp2019_2                   │ 670       │ 109    │ 7       │ 554      │ -0.664 │ 1128 │
│ aspp2019_0                   │ 671       │ 10     │ 29      │ 632      │ -0.927 │  778 │
└──────────────────────────────┴───────────┴────────┴─────────┴──────────┴────────┴──────┘
Debilski commented 1 month ago

(Parallelisation is by the way a relative non-issue. Thanks to having a proper database we can just run a bunch of ci_engines at the same time.)

Debilski commented 2 weeks ago

For reference, it is now possible to extract all games with errors from the database:

select * from games where json_extract(final_state, '$.num_errors') != '[0,0]' ;