ctm / mb2-doc

Mb2, poker software
https://devctm.com
7 stars 2 forks source link

Table hung in this evening's tournament #1271

Closed ctm closed 7 months ago

ctm commented 7 months ago

Find and fix whatever caused the hang seen with this evening's tournament on table 5563.

I was signed up for the tournament and was assigned to table 5563, but I missed the start due to needing to talk to my step-mother after finding out that she had taken a nasty fall earlier in the day. When I finally did log in I found that my assigned table wasn't running. sudo docker logs -t mb2 didn't show anything, but the various logs in the database will probably tell the story.

I've added the regression label, but this probably isn't a regression, per-se, because it probably only applies to re-entries, which are still relatively new.

Ugh!

In addition, albeit tracked as issue #1272 :

ts4z: the poker bug gods have awared ADB DaVoice a bonus 586k in chips

ts4z commented 7 months ago

Chip count bug may be cosmetic. After the break, our table refreshed with the right players and reasonable chip counts. It was, however, still frozen.

ctm commented 7 months ago

Thank you for the additional info @ts4z. I've spun off a new issue that I'll get to eventually (and will keep in the back of my mind as I work on this one).

ctm commented 7 months ago

It looks like the problem is simply that a re-entry during a pause for redraw hoses the table. Here are the relevant messages:

mb2=> select id, received_at at time zone 'mst', substr(message::text, 0, 80), table_id from public_table_messages where player_id is null and table_id in (5563, 5564) order by received_at;
    id    |          timezone          |                                     substr                                      | table_id 
----------+----------------------------+---------------------------------------------------------------------------------+----------
...
 10282651 | 2024-01-03 18:12:01.459779 | {"Bust": [9, null, 4961, 0, 26, 0, null, true, 1]}                              |     5563
 10282652 | 2024-01-03 18:12:05.252207 | {"Checks": [14, null]}                                                          |     5564
 10282653 | 2024-01-03 18:12:05.256814 | {"BeginRound": [1, null, 1200, 0, false, 99]}                                   |     5564
 10282654 | 2024-01-03 18:12:05.262967 | {"Status": {"board": {"grid": null, "cards": {"cards": [{"Natural": {"rank": 4, |     5564
 10282655 | 2024-01-03 18:12:05.267446 | {"Remind": [20000, 0, "2024-01-04T01:12:25.250953112Z", 13, 20]}                |     5564
 10282656 | 2024-01-03 18:12:08.401969 | "PausedForRedraw"                                                               |     5563
 10282657 | 2024-01-03 18:12:08.549547 | {"Bets": [13, 2400, null, false]}                                               |     5564
 10282658 | 2024-01-03 18:12:08.554145 | {"Remind": [20000, 2400, "2024-01-04T01:12:28.548180508Z", 14, 20]}             |     5564
 10282659 | 2024-01-03 18:12:11.673466 | {"Chat": [37, "gg"]}                                                            |     5563
 10282660 | 2024-01-03 18:12:11.831458 | {"NewPlayersAdded": [4961, [9]]}                                                |     5564
...

after the PausedForRedraw, the only messages that I see that were sent to 5563 were Chat, OnLevel and MovedFrom until table 5564 was finished and everyone was merged back to 5563.

So, logically when Grizz got re-entered to 5564, that made the server think there was no longer a need to have the tables paused for redraw, but it updated its state without actually restarting 5563. That should be trivial to find through code inspection.

ctm commented 7 months ago

The solution is to add a method that checks to see if there are any paused tables that need to be restarting and to restart them and then to call this method after new players are added (either via late registration or re-entry) and to also call it at the end of a hand, although if any paused tables need restarting at the end of a hand, use warn! to let us know something went wrong, because that should never happen.

I'll code this up and deploy it after I get back from my morning run.

Whee!

ctm commented 7 months ago

I chose not to run after all, since I'd just be champing at the bit to fix this bug.

Upon reconsideration, I chose not to put the sanity check in at EndOfHand. I think that would be an anti-pattern use of defensive programming. It either wouldn't change anything or it would fix a real problem while creating a log entry that I could overlook or it could conceivably transform a bug that would be hard to miss into a subtle bug.

So, I've coded up a solution for the actual problem we saw last night and am deploying that now.