ASPP / pelita

Actor-based Toolkit for Interactive Language Education in Python
https://github.com/ASPP/pelita_template
Other
62 stars 68 forks source link

Exception in bot crashes server mode #799

Closed otizonaizit closed 2 months ago

otizonaizit commented 2 months ago

I am reporting here an issue for future reference, because at the moment I can not implement a simple reproducer.

While running in server mode:

server$ pelita-server remote-server --address XXXXX --port YYYY --config pelita_server_conf.yaml

if one of the configured bot throws an Exception during a game, it seems like the server keeps on running (systemd reports the service as running) but it stops responding to new remote requests:

client$ pelita demo01_stopping.py pelita://XXXXX
Remote player requested. Scanning server for players.
Server did not reply in 5000 ms.
No remote team selected. Exiting.

The Exception found in the log was the following one:

Traceback (most recent call last):
  File "ZZZZ/pelita_private_bots/sparring_bots/aspp2022_3/bot.py", line 45, in move
    next_pos = check_next_pos( bot, state , next_pos)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ZZZZ/pelita_private_bots/sparring_bots/aspp2022_3/group3_utils.py", line 42, in check_next_pos
    next_pos = check_legal_status(bot, state, next_pos)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ZZZZ/pelita_private_bots/sparring_bots/aspp2022_3/group3_utils.py", line 211, in check_legal_status
    next_pos = smart_random_choice( bot, state)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "ZZZZ/pelita_private_bots/sparring_bots/aspp2022_3/group3_utils.py", line 120, in smart_random_choice
    next_pos = bot.random.choice(   bot.legal_positions  )
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/random.py", line 373, in choice
    raise IndexError('Cannot choose from an empty sequence')
IndexError: Cannot choose from an empty sequence

I am assuming the Exception and the freeze are connected, but a reproducer would be needed to really make sure that the two things are indeed causally connected :-)

otizonaizit commented 2 months ago

OK, at least now I can confirm that the exception freezes the server. For proper debugging we need of course an easier reproducer.

Debilski commented 2 months ago

Ok, I can reproduce it. The problem occurs because we hang in def handle_known_client when sending to the pair socket of the broken player. There needs to be a timeout on send (in fact all recv and sends need to have timeouts) in that function. Additionally, it might make sense to have the whole server rewritten async-style so that a slow send for whatever reason does not block the server.