Bugfix enet disconnects in scenarios where console.step() is called infrequently

default0 commented 1 year ago

Hi

I've experienced an issue where not calling console.step() frequently enough will lead to console.step() permanently not working. The issue should be reproducible using the following code:

import datetime
import time
import melee

console = melee.Console(fullscreen=False,
disable_audio=True,
polling_mode=False,
blocking_input=True,
                        path="/path/to/dolphin"
                        )
controller = melee.Controller(console=console,
                              port=1,
                              type=melee.ControllerType.STANDARD)
console.run("/path/to/melee/iso")
console.connect()
controller.connect()

while True:
    print(f"{(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f'))} Calling console.step()")
    gamestate = console.step()
    # step() returns None when the file ends
    if gamestate is None:
        continue
    print(f"{(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f'))} Frame {gamestate.frame} - Waiting for {gamestate.frame ** 2} seconds")
    time.sleep(gamestate.frame ** 2)

For me, this usually locks up after a few frames (frame ~7 or ~8 on my system), where console.step() will hang for a very long time and then return None (despite having polling_mode=False).

After investigating the issue for a bit, I noticed that libmelee only calls the ENet's host.service() method whenever console.step() is called. Infrequent calls to this method lead - according to ENet's documentation - to failures where one side of the connection may no longer be able to receive events (see here).

This is vexing because in many ML scenarios being able to control when the environment advances is crucial and certain types of algorithms (like fe, PPO) are often implemented in a way where after data has been collected from the environment and the network gets updated, the environment is effectively paused during this computation (which, depending on the hyperparameters can take a while).

After looking into options for correcting this, I have two proposals for how to go about fixing this, but would like to know which one is preferable before I sink time into an implementation/pull request.

Option 1: Make slippstream fully asynchronous. This involves spawning a background thread in slippstream that calls dispatch() in a loop (and also making dispatch() private) to manually build up an event queue. This event queue can then simply be read/waited for by the console whenever step() is called.

Option 2: Allow calling step() without flushing controller inputs This keeps the game frozen and in a predictable state while still updating the network and calling host.service() - however the user must still make sure that step() is called frequently enough to avoid networking issues.

A quick answer which approach would be preferable in a pull request - or an outline for another better/preferable approach would be welcome.

Cheers

altf4 commented 1 year ago

Yup, that is a known deficiency right now.

The workaround is to not block for very long in the frame loop. If you're making an AI with a neural network, for instance, try shipping your data out to another process / machine to do backprop.

Ideally, we'd tear out the ENet connection from dolphin completely and replace it with websockets. ENet is really inappropriate as this layer. It's only there since it meant we didn't have to add a new dependency to Dolphin.

altf4 commented 1 year ago

Alternatively, xpilot made a fork that does a multiprocessing thing that keeps polling the enet connection for you. I didn't like it for mainstream libmelee. I THINK it's this branch? https://github.com/vladfi1/libmelee/tree/enet_timeout But you might want to ask xpilot in the Slippi discord.

default0 commented 1 year ago

I've already worked around this problem locally on my end and don't require a fix for myself - if there is no intent of fixing the issue in this repo until changes are made to the dolphin site, I think it's most appropriate to close this issue.

Cheers

altf4 commented 1 year ago

yea, imo this is a Dolphin problem.

altf4 / libmelee

Bugfix enet disconnects in scenarios where console.step() is called infrequently #91