Mismatch between IMPALA discounting steps vs. actual time steps taken in NetHack

Background

In NetHack, some action key press combinations only count as one time step in the game, while other combinations count as many time steps. As an example of the latter, one can issue a number (e.g. '8') followed by an action like search ('s'). This will make the agent search 8 times and move time forward by the same amount, while requiring only two key presses.

Issue

The discounting logic for the current IMPALA agent (see https://github.com/facebookresearch/nle/blob/main/nle/agent/agent.py#L255) seems to only rely on the number of key presses, and fails to capture the phenomenon described in the background above where the agent might issue a number followed by another command. As a result, issuing a numbered action (i.e. an action that is preceded by a number) can become better than manually repeating the action, since using numbers will avoid part of the discounting penalty that should have been applied (i.e. issuing '8' followed by 's' is discounted as if the agent only took two actions, while it should be discounted as if 8 actions were taken). This results in trained agents that regularly take numbered actions, while there is no advantage to do so in the real NetHack game.

NOTE: The combination of numbers and actions above is just an example, and there are other times a similar discrepancy happens (e.g. when wielding a weapon, which requires two keypresses but only one time step in NetHack).

Request

Do you have any suggestions for how to get around this issue? For now I was thinking of writing my own discounting logic that applies discounting based on the number of time steps passed in the environment, but I wanted to check here first for any thoughts.

NOTE

I wasn't sure which category this question falls into (bug, feature, ...) so I went ahead and started with this blank issue. However, if you feel like one of the other templates is more appropriate, let me know and I'm happy to reformat into one of those : )

Hi Jens, thanks for opening this 'issue'. This is an astute observation and indeed one that we have noticed. In general the agent will learn 'shortcuts' for certain actions where possible, and this includes using 'move a lot' actions rather than move one step ('J' vs ['j','j','j']) and it will also repeatedly search etc ('9s' etc), among other behaviours.

I dont think we see this as a flaw in the agent, indeed if anything its probably more efficient style of play. If you'd like to change discounting to be based on in-game clock time, you could check the 'turns' counter in blstats. I warn you this too might cause strange behaviours. Certain actions take different amounts of 'turns' and you might disincentivise actions that take a long time like engraving, or reting.

It's just a thought, do post back on here if you try this, I'd be curious about the results. For now, since this isnt really an issue - i will close it for now. Perhaps it might warrant a discusison on the NetHack Challenge discord - https://discord.gg/zkFWQmSWBA

facebookresearch / nle