facebookresearch / nle

The NetHack Learning Environment
Other
940 stars 114 forks source link

Dump core message in wizard mode after max steps is reached #296

Closed kolbytn closed 2 years ago

kolbytn commented 2 years ago

🐛 Bug

Wizard mode causes message "Dump core? [ynq] (q)" message when agent survives past max_steps.

To Reproduce

Steps to reproduce the behavior:

  1. Set wizard=True
  2. Set max_steps=10
  3. Take > 10 steps.

Expected behavior

Nle should be able to quit normally when max_steps is reached in wizard mode.

Environment

NLE version: 0.7.3
PyTorch version: 1.10.0+cu113
Is debug build: No
CUDA used to build PyTorch: 11.3

OS: Ubuntu 20.04.2 LTS
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
CMake version: version 3.21.3

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 495.29.05
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.10.0+cu113
[pip3] torchtext==0.11.0
[conda] Could not collect

Additional context

I am using nle via minihack. I assume the bug is in nle, but let me know if I should create an issue for minihack instead.

Additionally nethack terminates for me with a segfault after about 10-15 minutes (~800K steps). I'm assuming the issues are related but I haven't been able to reproduce the segfault consistently enough to check.

kolbytn commented 2 years ago

Update, I created a temporary work around and prevented the Dump core? [ynq] (q) message by manually resetting nethack before reaching max steps, but I still consistently get segfaults after ~800K steps. If I turn off wizard mode the segfault goes away.

heiner commented 2 years ago

Hey Kolby,

Thanks for your interest in NLE and your feedback.

It's true that wizard ("debug") mode is used in NLE as more or less a hack, and lots of issues surface. I'll look into the two issues you mentioned soon; I suspect the first one should be an easy fix, while the second one may not be. Stay tuned.

heiner commented 2 years ago

Hey Kolby,

Could you add a more thorough reproduction step, and maybe include a ttyrec? My attempts at reproducing this behavior have failed: https://github.com/facebookresearch/nle/pull/298

The Dump core? question gets auto-declined for me due to this line: https://github.com/facebookresearch/nle/blob/main/nle/env/base.py#L639, which has been there for quite a while.

kolbytn commented 2 years ago

Sorry looks like I had allow_all_yn_questions=True. I assume that the Dump Core message still shouldn't appear even with yes/no questions activated.

I'll create more thorough reproduction steps today if you still have issues reproducing.

heiner commented 2 years ago

Thanks for the reply!

Note that I am using allow_all_yn_questions=True in #298, which seems to pass w/o failing.

kolbytn commented 2 years ago

It was never an issue of the environment failing. In wizard mode the environment correctly returns done after the max steps have been reached, but the final message from the observation is Dump core? [ynq] (q). This on its own may not be an issue, but I assumed that it was related to the segfaults I was getting after more training.

The below code prints Found 'Dump core? [ynq] (q)' on step 99 when wizard=True but nothing when wizard=False. I assumed this was evidence of some further bug that was causing the segfaults.

If this is a known instability in wizard mode, I understand.

import gym
import nle

env = gym.make(
    "NetHack-v0",
    wizard=True,
    max_episode_steps=100,
    allow_all_yn_questions=True,
    allow_all_modes=True,
)

obs = env.reset()
for i in range(200):
    a = env.action_space.sample()
    obs, reward, done, info = env.step(a)

    idx = env._observation_keys.index("message")
    message = env.last_observation[idx].tobytes().decode("utf-8")
    if "Dump core?" in message:
        print("Found '{}' on step {}".format(message, i))

    if done:
        break
env.close()
heiner commented 2 years ago

Ah, I'm sorry about misunderstanding.

The Dump core? [ynq] (q) question is happening for all #quited wizard mode NetHack games and does not indicate that an actual issue has occurred.

kolbytn commented 2 years ago

Awesome thank you for your help! I'll comment here if I find a better way to reproduce the segfault.

heiner commented 2 years ago

I'll comment here if I find a better way to reproduce the segfault.

Awesome, please re-open this issue at that point.